[SOLVED] Extracting webpage content to CSV rows

scepxkoscepxko Member Posts: 15 Maven
edited March 2019 in Help
Hi everyone.
Old & (very) rusty Rapidminer fan needs a hint! ;) 

* I have a single webpage containing information I want to export into a CSV file.
* At the end of the process, I'm expecting 3 columns (name, address, URL).
* With my current flow, I get a single column containing all the names in the first rows, then all the addresses, then all the URLs...

Here's the flow (Rapidminer 5.3, but it's the same result with 9.2)

Thank you!
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br><process version="5.3.015"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process"><br>    <process expanded="true"><br>      <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="45" y="30"><br>        <parameter key="file" value="E:\Rapidminer\Expert.htm"/><br>        <parameter key="extract_text_only" value="false"/><br>        <parameter key="use_file_extension_as_type" value="false"/><br>        <parameter key="encoding" value="UTF-8"/><br>      </operator><br>      <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="180" y="30"><br>        <parameter key="create_word_vector" value="false"/><br>        <parameter key="add_meta_information" value="false"/><br>        <parameter key="keep_text" value="true"/><br>        <process expanded="true"><br>          <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document (3)" width="90" x="45" y="30"><br>            <list key="string_machting_queries"><br>              <parameter key="url" value="&lt;a href=&quot;.&quot;&gt;&lt;span"/><br>              <parameter key="title" value="&lt;span class=&quot;title&quot;&gt;.&lt;/span&gt;"/><br>              <parameter key="address" value="&lt;span class=&quot;address&quot;&gt;.&lt;/span&gt;"/><br>            </list><br>            <list key="regular_expression_queries"/><br>            <list key="regular_region_queries"/><br>            <list key="xpath_queries"><br>              <parameter key="link" value="//h:a[@class=&quot;PinImage ImgLink&quot;]/@href"/><br>            </list><br>            <list key="namespaces"/><br>            <list key="index_queries"/><br>            <process expanded="true"><br>              <connect from_port="segment" to_port="document 1"/><br>              <portSpacing port="source_segment" spacing="0"/><br>              <portSpacing port="sink_document 1" spacing="0"/><br>              <portSpacing port="sink_document 2" spacing="0"/><br>            </process><br>          </operator><br>          <connect from_port="document" to_op="Cut Document (3)" to_port="document"/><br>          <connect from_op="Cut Document (3)" from_port="documents" to_port="document 1"/><br>          <portSpacing port="source_document" spacing="0"/><br>          <portSpacing port="sink_document 1" spacing="0"/><br>          <portSpacing port="sink_document 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes" width="90" x="315" y="30"><br>        <parameter key="attribute_filter_type" value="subset"/><br>        <parameter key="attributes" value="|address|url|title"/><br>      </operator><br>      <operator activated="true" class="write_excel" compatibility="5.3.015" expanded="true" height="76" name="Write Excel" width="90" x="450" y="30"><br>        <parameter key="excel_file" value="E:\Rapidminer\expert.xls"/><br>      </operator><br>      <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/><br>      <connect from_op="Process Documents" from_port="example set" to_op="Select Attributes" to_port="example set input"/><br>      <connect from_op="Select Attributes" from_port="example set output" to_op="Write Excel" to_port="input"/><br>      <connect from_op="Write Excel" from_port="through" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process>



Tagged:

Best Answer

  • scepxkoscepxko Member Posts: 15 Maven
    edited February 2019 Solution Accepted
    Thank you for reminding me about splitting and merging using IDs!
    I found a suitable (but non elegant) solution that does the job:

    Read the page -> multiply -> 1x Cut Document + 1x Generate ID for each element I wanted, then Join the attributes (1+2)+3.

    Here my working solution:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?><br><process version="5.3.015"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process"><br>    <process expanded="true"><br>      <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="45" y="75"><br>        <parameter key="file" value="E:\Rapidminer\Expert.htm"/><br>        <parameter key="extract_text_only" value="false"/><br>        <parameter key="use_file_extension_as_type" value="false"/><br>        <parameter key="encoding" value="UTF-8"/><br>      </operator><br>      <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="112" name="Multiply (2)" width="90" x="179" y="75"/><br>      <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document (5)" width="90" x="313" y="30"><br>        <list key="string_machting_queries"><br>          <parameter key="title" value="&lt;span class=&quot;title&quot;&gt;.&lt;/span&gt;"/><br>        </list><br>        <list key="regular_expression_queries"/><br>        <list key="regular_region_queries"/><br>        <list key="xpath_queries"><br>          <parameter key="link" value="//h:a[@class=&quot;PinImage ImgLink&quot;]/@href"/><br>        </list><br>        <list key="namespaces"/><br>        <list key="index_queries"/><br>        <process expanded="true"><br>          <connect from_port="segment" to_port="document 1"/><br>          <portSpacing port="source_segment" spacing="0"/><br>          <portSpacing port="sink_document 1" spacing="0"/><br>          <portSpacing port="sink_document 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data (5)" width="90" x="447" y="30"><br>        <parameter key="text_attribute" value="text2"/><br>        <parameter key="add_meta_information" value="false"/><br>      </operator><br>      <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (7)" width="90" x="581" y="30"/><br>      <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document (6)" width="90" x="313" y="210"><br>        <list key="string_machting_queries"><br>          <parameter key="url" value="&lt;a href=&quot;.&quot;&gt;&lt;span"/><br>        </list><br>        <list key="regular_expression_queries"/><br>        <list key="regular_region_queries"/><br>        <list key="xpath_queries"><br>          <parameter key="link" value="//h:a[@class=&quot;PinImage ImgLink&quot;]/@href"/><br>        </list><br>        <list key="namespaces"/><br>        <list key="index_queries"/><br>        <process expanded="true"><br>          <connect from_port="segment" to_port="document 1"/><br>          <portSpacing port="source_segment" spacing="0"/><br>          <portSpacing port="sink_document 1" spacing="0"/><br>          <portSpacing port="sink_document 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data (4)" width="90" x="447" y="210"><br>        <parameter key="text_attribute" value="text1"/><br>        <parameter key="add_meta_information" value="false"/><br>      </operator><br>      <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID" width="90" x="581" y="210"/><br>      <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document (7)" width="90" x="313" y="120"><br>        <list key="string_machting_queries"><br>          <parameter key="address" value="&lt;span class=&quot;address&quot;&gt;.&lt;/span&gt;"/><br>        </list><br>        <list key="regular_expression_queries"/><br>        <list key="regular_region_queries"/><br>        <list key="xpath_queries"><br>          <parameter key="link" value="//h:a[@class=&quot;PinImage ImgLink&quot;]/@href"/><br>        </list><br>        <list key="namespaces"/><br>        <list key="index_queries"/><br>        <process expanded="true"><br>          <connect from_port="segment" to_port="document 1"/><br>          <portSpacing port="source_segment" spacing="0"/><br>          <portSpacing port="sink_document 1" spacing="0"/><br>          <portSpacing port="sink_document 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data (6)" width="90" x="447" y="120"><br>        <parameter key="text_attribute" value="text3"/><br>        <parameter key="add_meta_information" value="false"/><br>      </operator><br>      <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (6)" width="90" x="581" y="120"/><br>      <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (3)" width="90" x="715" y="75"><br>        <list key="key_attributes"/><br>      </operator><br>      <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (4)" width="90" x="782" y="210"><br>        <list key="key_attributes"/><br>      </operator><br>      <operator activated="true" class="write_excel" compatibility="5.3.015" expanded="true" height="76" name="Write Excel" width="90" x="916" y="210"><br>        <parameter key="excel_file" value="E:\Rapidminer\expert.xls"/><br>      </operator><br>      <connect from_op="Read Document" from_port="output" to_op="Multiply (2)" to_port="input"/><br>      <connect from_op="Multiply (2)" from_port="output 1" to_op="Cut Document (5)" to_port="document"/><br>      <connect from_op="Multiply (2)" from_port="output 2" to_op="Cut Document (7)" to_port="document"/><br>      <connect from_op="Multiply (2)" from_port="output 3" to_op="Cut Document (6)" to_port="document"/><br>      <connect from_op="Cut Document (5)" from_port="documents" to_op="Documents to Data (5)" to_port="documents 1"/><br>      <connect from_op="Documents to Data (5)" from_port="example set" to_op="Generate ID (7)" to_port="example set input"/><br>      <connect from_op="Generate ID (7)" from_port="example set output" to_op="Join (3)" to_port="left"/><br>      <connect from_op="Cut Document (6)" from_port="documents" to_op="Documents to Data (4)" to_port="documents 1"/><br>      <connect from_op="Documents to Data (4)" from_port="example set" to_op="Generate ID" to_port="example set input"/><br>      <connect from_op="Generate ID" from_port="example set output" to_op="Join (4)" to_port="right"/><br>      <connect from_op="Cut Document (7)" from_port="documents" to_op="Documents to Data (6)" to_port="documents 1"/><br>      <connect from_op="Documents to Data (6)" from_port="example set" to_op="Generate ID (6)" to_port="example set input"/><br>      <connect from_op="Generate ID (6)" from_port="example set output" to_op="Join (3)" to_port="right"/><br>      <connect from_op="Join (3)" from_port="join" to_op="Join (4)" to_port="left"/><br>      <connect from_op="Join (4)" from_port="join" to_op="Write Excel" to_port="input"/><br>      <connect from_op="Write Excel" from_port="through" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process><br><br>




Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Filter Examples by Range should enable you to turn this into 3 separate datasets (one for names, then addresses, then URLs.).  Then just add an index to each one using Generate ID, and then Merge Attributes them all back together (you'll need the free Operator Toolbox extension for this last step).
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.