Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"Cut Document II Crawling"

FlakeFlake Member Posts: 13 Contributor II
edited June 2019 in Help
hi there, I did notice there is another post about cutting document raised by Roberto and answered by Matthias.

However, first part of my problem is a little bit different from that post, but I believe it is an even easier one for the people who know how to solve it.

Questions:

1. I will retrieve a web page, e.g. Terms of service page of Google. I want to put each paragraph into a raw in the output excel. I am not familiar with regular expression kind of things, please help me here.

2. Does RM support to crawl the Internet, say, finding hundreds of pages returned by search keyword "Terms of Service"?

Thanks in advance.

Answers

  • colocolo Member Posts: 236 Maven
    Hi Flake,

    let's see if I can answer the second cut document topic as well ;)

    If you want to get each paragraph (or some other HTML element) out of a website, I would probably prefer using XPath rather than writing regular expressions. The expression //h:p will find every paragraph at any depth (h is the default namespace for HTML elements):
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
       <process expanded="true" height="145" width="413">
         <operator activated="true" class="web:get_webpage" compatibility="5.1.002" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
           <parameter key="url" value="http://microsoft.com"/>
           <parameter key="random_user_agent" value="true"/>
           <list key="query_parameters"/>
         </operator>
         <operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
           <parameter key="query_type" value="XPath"/>
           <list key="string_machting_queries"/>
           <list key="regular_expression_queries"/>
           <list key="regular_region_queries"/>
           <list key="xpath_queries">
             <parameter key="paragraph" value="//h:p"/>
           </list>
           <list key="namespaces"/>
           <list key="index_queries"/>
           <process expanded="true" height="607" width="763">
             <connect from_port="segment" to_port="document 1"/>
             <portSpacing port="source_segment" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="text:documents_to_data" compatibility="5.1.001" expanded="true" height="76" name="Documents to Data" width="90" x="313" y="30">
           <parameter key="text_attribute" value="segment"/>
         </operator>
         <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
         <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
         <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    RapidMiner provides the "Crawl Web" operator for crawling but this is very slow when checking keywords within the document content. Perhaps some alternative crawlers (e.g. HTTRACK, Heritrix) will perform much better. Maybe someday an advanced crawler will replace the current implementation. There are one or two older topics with discussions about this.

    Regards
    Matthias

    P.S. Please consider posting questions like this in the "Problems and Support Forum". In my opinion the forum's description is closer to many of the topics created here.
  • FlakeFlake Member Posts: 13 Contributor II
    Dear Matthias,

    Many thanks for your help! It works for my purpose with few simple tweaks. :)

    Below is my process. Actually what I added are the things to remove the HTML tag sort of things and extract only the texts. But I run into problems such as several empty rows are generated due to my solution. Then, I had to add another Remove Duplicate operator to remove them.

    However, 'cause I am learning to use RM, I believe I didn't do it in the best way.

    If you are interested, could you give some suggestions on how to improve here?
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
        <process expanded="true" height="554" width="1217">
          <operator activated="true" class="web:get_webpage" compatibility="5.1.003" expanded="true" height="60" name="Get Page" width="90" x="45" y="120">
            <parameter key="url" value="http://www.microsoft.com/about/legal/en/us/IntellectualProperty/Copyright/Default.aspx"/>
            <parameter key="random_user_agent" value="true"/>
            <list key="query_parameters"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="5.1.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="120">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="paragraph" value="//h:p"/>
              <parameter key="list" value="//h:li"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="673" width="1293">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.1.002" expanded="true" height="94" name="Process Documents" width="90" x="514" y="120">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="add_meta_information" value="false"/>
            <parameter key="keep_text" value="true"/>
            <process expanded="true" height="673" width="1293">
              <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.003" expanded="true" height="60" name="Extract Content" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="remove_duplicates" compatibility="5.1.011" expanded="true" height="76" name="Remove Duplicates" width="90" x="648" y="120">
            <parameter key="attribute" value="text"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="782" y="120"/>
          <operator activated="true" class="write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="983" y="210">
            <parameter key="excel_file" value="D:\Desktop\documents.xls"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Write Excel" to_port="input"/>
          <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • colocolo Member Posts: 236 Maven
    Hi Flake,

    this looks good to me. I would probably prefer "Filter Examples" to get rid of the empty rows instead of using "Remove Duplicates", but this isn't really important.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.011">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
       <process expanded="true" height="589" width="835">
         <operator activated="true" class="web:get_webpage" compatibility="5.1.002" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
           <parameter key="url" value="http://www.microsoft.com/about/legal/en/us/IntellectualProperty/Copyright/Default.aspx"/>
           <parameter key="random_user_agent" value="true"/>
           <list key="query_parameters"/>
         </operator>
         <operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
           <parameter key="query_type" value="XPath"/>
           <list key="string_machting_queries"/>
           <list key="regular_expression_queries"/>
           <list key="regular_region_queries"/>
           <list key="xpath_queries">
             <parameter key="paragraph" value="//h:p"/>
             <parameter key="list" value="//h:li"/>
           </list>
           <list key="namespaces"/>
           <list key="index_queries"/>
           <process expanded="true" height="589" width="30">
             <connect from_port="segment" to_port="document 1"/>
             <portSpacing port="source_segment" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
           <parameter key="create_word_vector" value="false"/>
           <parameter key="add_meta_information" value="false"/>
           <parameter key="keep_text" value="true"/>
           <process expanded="true" height="589" width="567">
             <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.002" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
             <connect from_port="document" to_op="Extract Content" to_port="document"/>
             <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="filter_examples" compatibility="5.1.011" expanded="true" height="76" name="Filter Examples" width="90" x="447" y="120">
           <parameter key="condition_class" value="attribute_value_filter"/>
           <parameter key="parameter_string" value="text != \w*"/>
         </operator>
         <operator activated="false" class="remove_duplicates" compatibility="5.1.011" expanded="true" height="76" name="Remove Duplicates" width="90" x="447" y="30">
           <parameter key="attribute" value="text"/>
           <parameter key="include_special_attributes" value="true"/>
         </operator>
         <operator activated="true" class="generate_id" compatibility="5.1.011" expanded="true" height="76" name="Generate ID" width="90" x="581" y="30"/>
         <operator activated="true" class="write_excel" compatibility="5.1.011" expanded="true" height="60" name="Write Excel" width="90" x="715" y="30">
           <parameter key="excel_file" value="D:\Desktop\documents.xls"/>
         </operator>
         <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
         <connect from_op="Cut Document" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
         <connect from_op="Process Documents" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
         <connect from_op="Filter Examples" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
         <connect from_op="Generate ID" from_port="example set output" to_op="Write Excel" to_port="input"/>
         <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    Since you are using more than one cut expression for the "Cut Document" operator, you may perhaps want to know where an example came from. If you are interested in this, you can activate "add meta data" for "Process Documents" and identify the source by looking at the attribute query_key (lots of the other attributes can be filtered out by using "Select Attributes"). If you don't need this information you're already fine.

    You have some possibilities for changing operator chaining a bit (e.g. put the HTML removal inside "Cut Document", putting "Cut Document" inside "Process Documents", etc.) but this doesn't really change anything. If I had created such a process this would probably look the same.

    Regards
    Matthias

Sign In or Register to comment.