"Extract data from a website to Excel-file"

currantcurrant Member Posts: 14 Contributor II
edited May 2019 in Help
Hi!

I want to extract some data from the following website:
http://ec.europa.eu/sanco_pesticides/public/index.cfm?event=substance.selection

I want to generate a list in Excel with an ID-column and a pesticide name-column. Both should be extracted from the website, section "2) Select Pesticide".

As I am a RM beginner, I did not manage to cut the html-document and generate the two columns... The only thing I was able to do, was to get the web page on my computer.

Could anyone give me some hints I could start with?

Thanks in advance!
Currant
Tagged:

Answers

  • colocolo Member Posts: 236 Maven
    Hu Currant,

    this seems to be an easy task. After retrieving the website I would suggest to use XPath since the desired data is structured within a HTML select element. If you want to deal with multiple XPath or RegEx matches there is only one interesting operator: "Cut Document" (contained in the Text Processing extension).

    Here a little working example. You just have to add the id and convert the collection into a table ("Documents to Data").
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
        <process expanded="true" height="145" width="279">
          <operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
            <parameter key="url" value="http://ec.europa.eu/sanco_pesticides/public/index.cfm?event=substance.selection"/>
            <parameter key="random_user_agent" value="true"/>
            <list key="query_parameters"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="5.1.001" expanded="true" height="60" name="Cut Document" width="90" x="179" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="content" value="//h:select[@id='optionsList1']/h:option/text()"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="607" width="747">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Best regards
    Matthias
  • currantcurrant Member Posts: 14 Contributor II
    Dear All,

    Now, I managed to the html-files on my computer and I could extract the html-information from the file. See script below.

    Code:
    <process version="5.1.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
        <process expanded="true" height="283" width="435">
          <operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
            <parameter key="file" value="C:\temp\Bentazon.htm"/>
            <parameter key="extract_text_only" value="false"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="179" y="30">
            <process expanded="true" height="296" width="429">
              <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="Namen" value="//h:h1[@style='text-align:left']/text()"/>
                  <parameter key="Animal" value="//h:div[@class='col60']/h:ul/h:li[2]/text()"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <operator activated="true" class="store" compatibility="5.1.006" expanded="true" height="60" name="Store" width="90" x="179" y="30">
                <parameter key="repository_entry" value="RapidMiner/RDs"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_op="Store" to_port="input"/>
              <connect from_op="Store" from_port="through" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Now, my question is:

    How can I export the attributes "Namen" and "Animal" to Excel?
    (It's probably an easy job ...  but I could not work it out...

    currant
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    you can select the desired attributes via the "Select Attributes" operator, just select subset as the attribute filter type and select the desired attributes. Then put in a "Write Excel" operator, specify the file, and voilà.

    Your modified process (you might have to change the selected attributes):
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
       <process expanded="true" height="283" width="547">
         <operator activated="true" class="text:read_document" compatibility="5.1.001" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
           <parameter key="file" value="C:\temp\Bentazon.htm"/>
           <parameter key="extract_text_only" value="false"/>
         </operator>
         <operator activated="true" class="text:process_documents" compatibility="5.1.001" expanded="true" height="94" name="Process Documents" width="90" x="179" y="30">
           <process expanded="true" height="296" width="550">
             <operator activated="true" class="text:extract_information" compatibility="5.1.001" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
               <parameter key="query_type" value="XPath"/>
               <list key="string_machting_queries"/>
               <list key="regular_expression_queries"/>
               <list key="regular_region_queries"/>
               <list key="xpath_queries">
                 <parameter key="Namen" value="//h:h1[@style='text-align:left']/text()"/>
                 <parameter key="Animal" value="//h:div[@class='col60']/h:ul/h:li[2]/text()"/>
               </list>
               <list key="namespaces"/>
               <list key="index_queries"/>
             </operator>
             <operator activated="true" class="store" compatibility="5.1.008" expanded="true" height="60" name="Store" width="90" x="179" y="30">
               <parameter key="repository_entry" value="RapidMiner/RDs"/>
             </operator>
             <connect from_port="document" to_op="Extract Information" to_port="document"/>
             <connect from_op="Extract Information" from_port="document" to_op="Store" to_port="input"/>
             <connect from_op="Store" from_port="through" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="select_attributes" compatibility="5.1.008" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
           <parameter key="attribute_filter_type" value="subset"/>
           <parameter key="attributes" value="|Namen|Animal"/>
         </operator>
         <operator activated="true" class="write_excel" compatibility="5.1.008" expanded="true" height="60" name="Write Excel" width="90" x="447" y="30">
           <parameter key="excel_file" value="C:\export.xls"/>
         </operator>
         <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
         <connect from_op="Process Documents" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
         <connect from_op="Select Attributes" from_port="example set output" to_op="Write Excel" to_port="input"/>
         <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    Regards,
    Marco
  • currantcurrant Member Posts: 14 Contributor II
    Now it works!

    Thanx a lot for your support!

    currant
Sign In or Register to comment.