"Using rapidminer as a crawler"

xtraplus · June 2011

Hi

I would like to use rapid-miner as a webcrawler. I want to give the program a list of Urls from a text file. Then Rapidminer should go thru, and extract from each url specific links, which then should be stored in another text file.

Can you do this with rapid-miner, please?

el_chief · June 2011

i bet if you searched youtube you could find something...

xtraplus · June 2011

thanks. your xpath googledoc video is great. Unfortunately Google allows only 50 xml imports.

el_chief · June 2011

i also have some rapidminer crawling videos

am investigating selenium+chrome to allow ajax/javascript scraping too

xtraplus · June 2011

Do you know a program where I can importXML(Url, xpathquery) like in googledocs, but with unlimited imports?

Can you do this in MS Excel?

el_chief · June 2011

here's the link:

http://lmgtfy.com/?q=rapidminer+web+crawling+

xtraplus · June 2011

Great Stuff! How do make webcrawler crawl a list of urls?

xtraplus · June 2011

I try to follow this one

http://rapid-i.com/rapidforum/index.php?action=printpage;topic=2753.0

How do I read my urllistfile into an example set?

xtraplus · June 2011

ok

Read Document --> Documents to Data --> Loop Examples

xtraplus · June 2011

I did this:

read in a document --> documents to Data --> extract macro --> Loop Examples

this is the underlying code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="386" width="748">
      <operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
        <parameter key="file" value="C:\Users\Home\Desktop\info.txt"/>
      </operator>
      <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="447" y="165">
        <list key="crawling_rules"/>
        <parameter key="output_dir" value="C:\Users\Home\Desktop\DATA"/>
        <parameter key="extension" value="html"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0"/>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data" width="90" x="179" y="30">
        <parameter key="text_attribute" value="string"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="datamanagement" value="double_array"/>
      </operator>
      <operator activated="true" class="extract_macro" compatibility="5.1.006" expanded="true" height="60" name="Extract Macro" width="90" x="313" y="30">
        <parameter key="macro" value="macro"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.1.006" expanded="true" height="76" name="Loop Examples" width="90" x="447" y="30">
        <parameter key="iteration_macro" value="macro"/>
        <process expanded="true" height="380" width="691">
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Loop Examples" to_port="example set"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="198"/>
    </process>
  </operator>
</process>

Is it correct so far?

How do you connect the webcrawler into this circle, please?

jgarcia · June 2011

The difficulty seems to be:
- The "Get Pages" operator accepts a list of links as reference but doesn't save to file.
- The "Crawl Web" operator saves pages to files but only accepts an URL as a fixed parameter.

Could any of the more advanced users help out?

Gruesse,
Joao G.

colo · June 2011

Hi,

the process setup does not make much sense until now.

You have to be aware of what you want to do. If you want to start with a single URL and then automatically catch links from the document and follow them to a maximum depth of traversal you need a Crawler. Otherwise, if you already have a complete of URLs you want to retrieve, you don't need to crawl and should use the links for retrieving the corresponding websites directly.

Since you mention a list of URLs I guess you don't need the "Crawl Web" operator. "Get Page" or "Get Pages" are more appropriate in this case. I don't know in which format the links are stored in your file. It would be easy, if you have the URLs in a table form like in CSV or XLS files. They can simply be read as example sets. If you have them as single lines in an ordinary text file you have to convert them to build a proper example set (please post an example from the list for further advices). After that you can use the operator "Get Pages" to retrieve the entire webpage for each URL in your list (the HTML code will be added as an attribute).
If you want to write the websits to files (as Joao mentioned) I would suggest "Get Page" instead, since you have to use a loop anyway.
After converting the URLs to an example set just add "Loop Examples" and go to the inner process of this operator (double-click it). Here you need "Extract Macro" to get the current URL. Add a "Get Page" operator (be sure to check the execution order of the operators ("Extract Macro" has to be first) and use the extracted macro value as URL parameter. The operator delivers a single document which can be written to disk via the "Write Document" operator.

This is just an outline of how to solve the task. If you need further help, please post your URL list (or an extract) let us know which parts are still unclear.

Regards
Matthias

P.S. If you use the "Modify" option in your first post, you don't have to add a new one every 10 minutes

xtraplus · June 2011

Thank you colo. This was very helpful.

I did:

Read Excel (followed the import wizard) --> Loop examples

Inside the Loop examples I did:

exa --> Extract macro --> Get Pages --> exa.

This is the correponding code and I get green light with both:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="380" width="691">
      <operator activated="true" class="read_excel" compatibility="5.1.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="165">
        <parameter key="excel_file" value="C:\Users\Home\Desktop\test.xls"/>
        <parameter key="imported_cell_range" value="A1:A2"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
          <parameter key="1" value="Name"/>
        </list>
        <parameter key="locale" value="English"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="http://www\.google\.de/.true.attribute_value.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.1.006" expanded="true" height="76" name="Loop Examples" width="90" x="246" y="165">
        <parameter key="iteration_macro" value="url"/>
        <process expanded="true" height="398" width="709">
          <operator activated="true" class="extract_macro" compatibility="5.1.006" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="30">
            <parameter key="macro" value="extract"/>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="5.1.000" expanded="true" height="60" name="Get Pages" width="90" x="380" y="30">
            <parameter key="link_attribute" value="%{extract}"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0"/>
            <parameter key="accept_cookies" value="all"/>
          </operator>
          <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_port="example set"/>
          <portSpacing port="source_example set" spacing="36"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Loop Examples" to_port="example set"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

With "Get Page" I could do just "Write Document" but I dont know how to use "Get Page" inside the "Loop examples" It has no input-connection and just one out. I tried to connect Extract Macro --> exa and Get Page --> out. But I did not seem to give me green lights.

And with "Get Pages" I get green lights, but I dont know how to get files after the "Loop Examples"

My start is an URL file with one Url per cell (in .xls) or per line (in txt.)

http://www.double.de
http://www.singel.de
http://www.tripple.de

And I would like to retrieve all pages of these url with depth 1

http://www.double.de
http://www.double.de/C
http://www.singel.de
http://www.singel.de/A
http://www.tripple.de
http://www.tripple.de/8
...

Probably I will need the crawler for that. But when I could be able to use the "Get Page", I believe I could use the "Web Crawl" as well. True?

colo · June 2011

Hi,

you are right, you need to crawl, if you want to follow links to a certain depth (in case of depth 1 you could also extract them via XPath or regular expressions, but the crawler is more comfortable). And you are also right in your assumption that "Crawl Web" and "Get Page" have to be included into the process in a similar way. If you use "Get Pages" you don't need the loop, you should conntect this to "Read Excel" directly. But this is just for clarification, since your list of URLs is not complete (sub-pages are not contained).

It seems that you are not really aware of the macro concept. If you set "Extract Macro" to macro type "data_value", you get the value for a defined row/attribute for a single row/example. The column is addressed by the parameter "attribute name" and the row is addressed by "example index". You could set the index to a discrete value as "1" but the loop provides a macro, which is automatically increased for each example considered. You can choose a name for this control variable via parameter "iteration macro" for "Loop Examples" (default is "example"). If you want to use a macro/variable value somewhere, just type %{macro_name}. I built a small example to illustrate this (you should be able to see how to include the crawler operator and how to feed a URL to it). The first operator "Subprocess" just generates some artificial data, as it might be delivered by "Read Excel" (in your case replace it by the "Read Excel" operator again).

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
    <process expanded="true" height="380" width="691">
      <operator activated="false" class="read_excel" compatibility="5.1.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="C:\Users\Home\Desktop\test.xls"/>
        <parameter key="imported_cell_range" value="A1:A2"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
          <parameter key="1" value="Name"/>
        </list>
        <parameter key="locale" value="English"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="http://www\.google\.de/.true.attribute_value.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="5.1.008" expanded="true" height="76" name="Subprocess" width="90" x="45" y="120">
        <process expanded="true" height="607" width="773">
          <operator activated="true" class="generate_nominal_data" compatibility="5.1.008" expanded="true" height="60" name="Generate Nominal Data" width="90" x="45" y="30">
            <parameter key="number_examples" value="3"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="number_of_values" value="1"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.1.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="att1"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="rename" compatibility="5.1.008" expanded="true" height="76" name="Rename" width="90" x="179" y="165">
            <parameter key="old_name" value="att1"/>
            <parameter key="new_name" value="url"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.008" expanded="true" height="76" name="Set Data" width="90" x="313" y="165">
            <parameter key="example_index" value="1"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://google.com"/>
            <list key="additional_values"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.008" expanded="true" height="76" name="Set Data (2)" width="90" x="447" y="165">
            <parameter key="example_index" value="2"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://microsoft.com"/>
            <list key="additional_values"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.008" expanded="true" height="76" name="Set Data (3)" width="90" x="581" y="165">
            <parameter key="example_index" value="3"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://mozilla.org"/>
            <list key="additional_values"/>
          </operator>
          <connect from_op="Generate Nominal Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_op="Set Data" to_port="example set input"/>
          <connect from_op="Set Data" from_port="example set output" to_op="Set Data (2)" to_port="example set input"/>
          <connect from_op="Set Data (2)" from_port="example set output" to_op="Set Data (3)" to_port="example set input"/>
          <connect from_op="Set Data (3)" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.1.008" expanded="true" height="94" name="Loop Examples" width="90" x="246" y="30">
        <process expanded="true" height="398" width="709">
          <operator activated="true" class="delay" compatibility="5.1.008" expanded="true" height="76" name="Delay" width="90" x="45" y="30">
            <parameter key="delay" value="fixed"/>
          </operator>
          <operator activated="true" class="extract_macro" compatibility="5.1.008" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="30">
            <parameter key="macro" value="website_url"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="165">
            <parameter key="url" value="%{website_url}"/>
            <list key="crawling_rules"/>
            <parameter key="max_pages" value="100"/>
            <parameter key="max_depth" value="1"/>
            <parameter key="domain" value="server"/>
            <parameter key="max_page_size" value="500"/>
          </operator>
          <connect from_port="example set" to_op="Delay" to_port="through 1"/>
          <connect from_op="Delay" from_port="through 1" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_port="example set"/>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="5.1.008" expanded="true" height="76" name="Append" width="90" x="447" y="30"/>
      <connect from_op="Subprocess" from_port="out 1" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

The operator "Delay" is optional and might be useful if you connect to pages from a similar URL multiple times (to avoid HTTP requests fired rapidly and perhaps to be banned in return).

Regards
Matthias

xtraplus · June 2011

Hi Matthias,

Thank you very much. It does run, but there is now a new problem. Storing a page always overwrites the page that was stored before. So I get one page in the end. When the crawler finds more pages to store from the current url it does store multiples files. But across several urls, each finding overwrites the old one.

In your code I changed the urls of "Set Data" and ran it with two simple crawling rules. The run-process is finished quickly, the log says it stored two pages but you get only one page, because the first gets overwritten.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="380" width="691">
      <operator activated="false" class="read_excel" compatibility="5.1.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="C:\Users\Home\Desktop\test.xls"/>
        <parameter key="imported_cell_range" value="A1:A2"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
          <parameter key="1" value="Name"/>
        </list>
        <parameter key="locale" value="English"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="http://www\.google\.de/.true.attribute_value.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="5.1.006" expanded="true" height="76" name="Subprocess" width="90" x="45" y="120">
        <process expanded="true" height="607" width="773">
          <operator activated="true" class="generate_nominal_data" compatibility="5.1.006" expanded="true" height="60" name="Generate Nominal Data" width="90" x="45" y="30">
            <parameter key="number_examples" value="3"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="number_of_values" value="1"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.1.006" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="att1"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="rename" compatibility="5.1.006" expanded="true" height="76" name="Rename" width="90" x="179" y="165">
            <parameter key="old_name" value="att1"/>
            <parameter key="new_name" value="url"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.006" expanded="true" height="76" name="Set Data" width="90" x="313" y="165">
            <parameter key="example_index" value="1"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://www.philosophischergarten.de/ "/>
            <list key="additional_values"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.006" expanded="true" height="76" name="Set Data (2)" width="90" x="447" y="165">
            <parameter key="example_index" value="2"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://www.ganuba.de/ "/>
            <list key="additional_values"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.006" expanded="true" height="76" name="Set Data (3)" width="90" x="581" y="165">
            <parameter key="example_index" value="3"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://www.stone.ag/jwa/de/home.jsp "/>
            <list key="additional_values"/>
          </operator>
          <connect from_op="Generate Nominal Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_op="Set Data" to_port="example set input"/>
          <connect from_op="Set Data" from_port="example set output" to_op="Set Data (2)" to_port="example set input"/>
          <connect from_op="Set Data (2)" from_port="example set output" to_op="Set Data (3)" to_port="example set input"/>
          <connect from_op="Set Data (3)" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.1.006" expanded="true" height="94" name="Loop Examples" width="90" x="246" y="30">
        <process expanded="true" height="398" width="709">
          <operator activated="true" class="delay" compatibility="5.1.006" expanded="true" height="76" name="Delay" width="90" x="45" y="30"/>
          <operator activated="true" class="extract_macro" compatibility="5.1.006" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="30">
            <parameter key="macro" value="website_url"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="165">
            <parameter key="url" value="%{website_url}"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value=".+imp.+"/>
              <parameter key="store_with_matching_url" value=".+imp.+"/>
            </list>
            <parameter key="output_dir" value="C:\Users\Home\Desktop\Sites"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="100"/>
            <parameter key="max_depth" value="1"/>
            <parameter key="domain" value="server"/>
            <parameter key="max_page_size" value="500"/>
          </operator>
          <connect from_port="example set" to_op="Delay" to_port="through 1"/>
          <connect from_op="Delay" from_port="through 1" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_port="example set"/>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="5.1.006" expanded="true" height="76" name="Append" width="90" x="447" y="30"/>
      <connect from_op="Subprocess" from_port="out 1" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I think something is wrong with the variables in the iteration macro

colo · July 2011

Hi xtraplus,

there is nothing wrong with the macros. This just works as it should. but not as intended

The crawler receives the same output directory parameter setting for every loop execution. So of course the files are overwritten while looping. To avoid this, you have to set some specific value for each iteration. I extended the process with an example for this. I extracted the domain as specific property and used this as a subfolder for the file output. This is again done by using a macro (second "Extract Macro" operator and appending the macro value to the "output dir" parameter of the "Crawl Web" operator.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
    <process expanded="true" height="380" width="691">
      <operator activated="false" class="read_excel" compatibility="5.1.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="C:\Users\Home\Desktop\test.xls"/>
        <parameter key="imported_cell_range" value="A1:A2"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
          <parameter key="1" value="Name"/>
        </list>
        <parameter key="locale" value="English"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="http://www\.google\.de/.true.attribute_value.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="5.1.008" expanded="true" height="76" name="Subprocess" width="90" x="45" y="120">
        <process expanded="true" height="607" width="773">
          <operator activated="true" class="generate_nominal_data" compatibility="5.1.008" expanded="true" height="60" name="Generate Nominal Data" width="90" x="45" y="30">
            <parameter key="number_examples" value="3"/>
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="number_of_values" value="1"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.1.008" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="att1"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="rename" compatibility="5.1.008" expanded="true" height="76" name="Rename" width="90" x="179" y="165">
            <parameter key="old_name" value="att1"/>
            <parameter key="new_name" value="url"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.008" expanded="true" height="76" name="Set Data" width="90" x="313" y="165">
            <parameter key="example_index" value="1"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://www.philosophischergarten.de/ "/>
            <list key="additional_values"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.008" expanded="true" height="76" name="Set Data (2)" width="90" x="447" y="165">
            <parameter key="example_index" value="2"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://www.ganuba.de/ "/>
            <list key="additional_values"/>
          </operator>
          <operator activated="true" class="set_data" compatibility="5.1.008" expanded="true" height="76" name="Set Data (3)" width="90" x="581" y="165">
            <parameter key="example_index" value="3"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="value" value="http://www.stone.ag/jwa/de/home.jsp "/>
            <list key="additional_values"/>
          </operator>
          <connect from_op="Generate Nominal Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_op="Set Data" to_port="example set input"/>
          <connect from_op="Set Data" from_port="example set output" to_op="Set Data (2)" to_port="example set input"/>
          <connect from_op="Set Data (2)" from_port="example set output" to_op="Set Data (3)" to_port="example set input"/>
          <connect from_op="Set Data (3)" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:generate_extract" compatibility="5.1.001" expanded="true" height="60" name="Generate Extract" width="90" x="179" y="120">
        <parameter key="source_attribute" value="url"/>
        <parameter key="query_type" value="Regular Expression"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries">
          <parameter key="url_domain" value="https?://(.*?)/"/>
        </list>
        <list key="regular_region_queries"/>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <list key="index_queries"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.1.008" expanded="true" height="94" name="Loop Examples" width="90" x="380" y="30">
        <process expanded="true" height="398" width="709">
          <operator activated="true" class="delay" compatibility="5.1.008" expanded="true" height="76" name="Delay" width="90" x="45" y="30"/>
          <operator activated="true" class="extract_macro" compatibility="5.1.008" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="30">
            <parameter key="macro" value="website_url"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="extract_macro" compatibility="5.1.008" expanded="true" height="60" name="Extract Macro (2)" width="90" x="313" y="30">
            <parameter key="macro" value="domain"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="url_domain"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="execute_program" compatibility="5.1.008" expanded="true" height="76" name="Execute Program" width="90" x="447" y="30">
            <parameter key="command" value="cmd.exe /c &quot;md C:\Users\Home\Desktop\Sites\%{domain}&quot;"/>
          </operator>
          <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="120">
            <parameter key="url" value="%{website_url}"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value=".+imp.+"/>
              <parameter key="store_with_matching_url" value=".+imp.+"/>
            </list>
            <parameter key="output_dir" value="C:\Users\Home\Desktop\Sites\%{domain}"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="5"/>
            <parameter key="max_depth" value="1"/>
            <parameter key="domain" value="server"/>
            <parameter key="max_page_size" value="500"/>
          </operator>
          <connect from_port="example set" to_op="Delay" to_port="through 1"/>
          <connect from_op="Delay" from_port="through 1" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Extract Macro (2)" to_port="example set"/>
          <connect from_op="Extract Macro (2)" from_port="example set" to_op="Execute Program" to_port="through 1"/>
          <connect from_op="Execute Program" from_port="through 1" to_port="example set"/>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="72"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="5.1.008" expanded="true" height="76" name="Append" width="90" x="514" y="30"/>
      <connect from_op="Subprocess" from_port="out 1" to_op="Generate Extract" to_port="Example Set"/>
      <connect from_op="Generate Extract" from_port="Example Set" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Edit: I forgot that the output directories for the crawler had to be existent before trying to save the files. Otherwise no data is written to the disk. I did this with the "Execute Program" operator, but this command is only valid for Windows operating systems. If you are working with another OS, you have to adapt the command.

Best regards
Matthias

xtraplus · July 2011

Hi Matthias,

great job. It took some while till I understood your program.

EDIT:

It still does not work roundly. There is something not exact with your regular expression. When the input url is:

http://www.abc.de/ (with a slash at the end)

then it works perfectly,

but when the slash at the end is missing, then the "Executive Program" process fails generating a error message:

Process ´cmd.exe/c "md C:\Users\Home\Desktop\Sites\?´" excited with error code 1.

So i think you have to change the regular expression somehow

Regards
Ben

colo · July 2011

Hi Ben,

extracting the domain with regular expressions was just a quick example to show how to possibly generate specific folders. You can use anything else you want (maybe a counting macro variable). I wanted to leave some work to you

, but if you want to use the domain regex, try this:

https?://(.*?)(/|$)

Regards
Matthias

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Using rapidminer as a crawler"

Answers