"Cut Document II Crawling"

Flake · September 2011

hi there, I did notice there is another post about cutting document raised by Roberto and answered by Matthias.

However, first part of my problem is a little bit different from that post, but I believe it is an even easier one for the people who know how to solve it.

Questions:

1. I will retrieve a web page, e.g. Terms of service page of Google. I want to put each paragraph into a raw in the output excel. I am not familiar with regular expression kind of things, please help me here.

2. Does RM support to crawl the Internet, say, finding hundreds of pages returned by search keyword "Terms of Service"?

Thanks in advance.

colo · September 2011

Hi Flake,

let's see if I can answer the second cut document topic as well

If you want to get each paragraph (or some other HTML element) out of a website, I would probably prefer using XPath rather than writing regular expressions. The expression //h:p will find every paragraph at any depth (h is the default namespace for HTML elements):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="145" width="413">
      <operator activated="true" class="web:get_webpage" compatibility="5.1.002" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
        <parameter key="url" value="http://microsoft.com"/>
        <parameter key="random_user_agent" value="true"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="paragraph" value="//h:p"/>
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <process expanded="true" height="607" width="763">
          <connect from_port="segment" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data" width="90" x="313" y="30">
        <parameter key="text_attribute" value="segment"/>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

RapidMiner provides the "Crawl Web" operator for crawling but this is very slow when checking keywords within the document content. Perhaps some alternative crawlers (e.g. HTTRACK, Heritrix) will perform much better. Maybe someday an advanced crawler will replace the current implementation. There are one or two older topics with discussions about this.

Regards
Matthias

P.S. Please consider posting questions like this in the "Problems and Support Forum". In my opinion the forum's description is closer to many of the topics created here.

Flake · September 2011

Dear Matthias,

Many thanks for your help! It works for my purpose with few simple tweaks.

Below is my process. Actually what I added are the things to remove the HTML tag sort of things and extract only the texts. But I run into problems such as several empty rows are generated due to my solution. Then, I had to add another Remove Duplicate operator to remove them.

However, 'cause I am learning to use RM, I believe I didn't do it in the best way.

If you are interested, could you give some suggestions on how to improve here?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="554" width="1217">
      <operator activated="true" class="web:get_webpage" compatibility="5.1.003" expanded="true" height="60" name="Get Page" width="90" x="45" y="120">
        <parameter key="url" value="http://www.microsoft.com/about/legal/en/us/IntellectualProperty/Copyright/Default.aspx"/>
        <parameter key="random_user_agent" value="true"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:cut_document" compatibility="5.1.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="120">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="paragraph" value="//h:p"/>
          <parameter key="list" value="//h:li"/>
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <process expanded="true" height="673" width="1293">
          <connect from_port="segment" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.1.002" expanded="true" height="94" name="Process Documents" width="90" x="514" y="120">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true" height="673" width="1293">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.003" expanded="true" height="60" name="Extract Content" width="90" x="447" y="30"/>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="remove_duplicates" compatibility="5.1.011" expanded="true" height="76" name="Remove Duplicates" width="90" x="648" y="120">
        <parameter key="attribute" value="text"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="782" y="120"/>
      <operator activated="true" class="write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="983" y="210">
        <parameter key="excel_file" value="D:\Desktop\documents.xls"/>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Remove Duplicates" to_port="example set input"/>
      <connect from_op="Remove Duplicates" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Write Excel" to_port="input"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

colo · September 2011

Hi Flake,

this looks good to me. I would probably prefer "Filter Examples" to get rid of the empty rows instead of using "Remove Duplicates", but this isn't really important.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="589" width="835">
      <operator activated="true" class="web:get_webpage" compatibility="5.1.002" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
        <parameter key="url" value="http://www.microsoft.com/about/legal/en/us/IntellectualProperty/Copyright/Default.aspx"/>
        <parameter key="random_user_agent" value="true"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="paragraph" value="//h:p"/>
          <parameter key="list" value="//h:li"/>
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <process expanded="true" height="589" width="30">
          <connect from_port="segment" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true" height="589" width="567">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.002" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="5.1.011" expanded="true" height="76" name="Filter Examples" width="90" x="447" y="120">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="text != \w*"/>
      </operator>
      <operator activated="false" class="remove_duplicates" compatibility="5.1.011" expanded="true" height="76" name="Remove Duplicates" width="90" x="447" y="30">
        <parameter key="attribute" value="text"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="581" y="30"/>
      <operator activated="true" class="write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="715" y="30">
        <parameter key="excel_file" value="D:\Desktop\documents.xls"/>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Write Excel" to_port="input"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Since you are using more than one cut expression for the "Cut Document" operator, you may perhaps want to know where an example came from. If you are interested in this, you can activate "add meta data" for "Process Documents" and identify the source by looking at the attribute query_key (lots of the other attributes can be filtered out by using "Select Attributes"). If you don't need this information you're already fine.

You have some possibilities for changing operator chaining a bit (e.g. put the HTML removal inside "Cut Document", putting "Cut Document" inside "Process Documents", etc.) but this doesn't really change anything. If I had created such a process this would probably look the same.

Regards
Matthias

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Cut Document II Crawling"

Answers