Read HTML Table - Extension Operator

btibertbtibert Member, University Professor Posts: 146 Guru
edited September 2019 in Help
I was expecting the following URL to parse properly:

https://www.hockey-reference.com/leagues/NHL_2019_skaters.html

However, the operator did not find any tables on the page.  The tutorial process does properly parse tables from wikipedia, but fails on the page above.
That said, this is my go-to reference for my students as the tables are easily parsed in R and Python.  For example:

import pandas as pd
tables = pd.read_html("https://www.hockey-reference.com/leagues/NHL_2019_skaters.html")
skaters = tables[0]
skaters.head().

Yes, there has to be some cleanup on the columns and data types, but that is part of the exercise and why I like using this reference.  I figured it would be even more powerful as a training exercise in RM given the amount of data prep that is necessary.

Any helps or tips on how to configure this operator would be much appreciated!

Tagged:

Best Answer

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    tagging @ey
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • eyey Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 21 RM Research
    Dear btibert 

    The "Read HTML Table/Tables" operator is built on a set of rules that allow it to extract data from HTML tables. Unfortunately, as seen from some other questions, very often, the visual html interface looks simpler than its html source code. The tables in the page you are referring to, are structured in a rather complex manner, which is not a standard table (as considered by the operator). The operator is using a classifier that was trained by data (a subset of common crawl corpus) and works best with more straight-forward tables (non-nested header, no column spanning, clear definition of row and data items in the table body without further nesting, etc.). Hence, in this case the operator is not successful. Also, the "Extract Structured Data" operator, which we wrote to extract data tables from sites which use schema.org standard to encode microdata (completely or incompletely), is not of much use here. The reasons are 1) there is no use of schema.org tags and 2) the dom based parsing used by this operator requries a hook to an encapsulating node in the DOM tree, which is not used in any standard manner i.e. the classes or itemtype for the table elements (header and rows) are missing.

    This leaves us to the last option, which is not ideal given the processing time involved, but I tried it anyway. Save the page as PDF, then try the "Read PDF Tables" operator on the PDF file. Try to tune extraction criteria so empty and no-row tables are ignored. This operator is using image detection techniqus from Tabula Java (and we plan to update it further) to extract data from detected tables. I am attaching the zip file containing IO objects (representing ExampleSets), which you can unzip and paste in your local repository. This operator was able to extract all data from the tables, but certain tables are appended in 1 ExampleSet, which you can split of course. This may not be an ideal solution but depending on your circumstances and context, it still provides one solution.

    Thanks for your feedback. We have been thinking on incorporating some recent libraries and will continue to evaluate how to improve web table extraction in near future.

    Best Regards,
    Edwin Yaqub
  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    @btibert Maybe this example can help you. 
    I now it is not as simple as on Python o R but this is how you may extract the information when the other operators fail
    I'm not the best with it but hope it works.
    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="313" y="34">
            <parameter key="url" value="https://www.hockey-reference.com/leagues/NHL_2019_skaters.html"/>
            <parameter key="random_user_agent" value="true"/>
            <parameter key="connection_timeout" value="10000"/>
            <parameter key="read_timeout" value="10000"/>
            <parameter key="follow_redirects" value="true"/>
            <parameter key="accept_cookies" value="all"/>
            <parameter key="cookie_scope" value="thread"/>
            <parameter key="request_method" value="GET"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
            <parameter key="override_encoding" value="false"/>
            <parameter key="encoding" value="SYSTEM"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="8.2.000" expanded="true" height="103" name="Extract_Information" width="90" x="447" y="34">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" compatibility="8.2.000" expanded="true" height="68" name="Cut Rows" width="90" x="380" y="34">
                <parameter key="query_type" value="Regular Region"/>
                <list key="string_machting_queries"/>
                <parameter key="attribute_type" value="Nominal"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries">
                  <parameter key="Row" value="&lt;tr &gt;.&lt;/tr&gt;"/>
                </list>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <parameter key="ignore_CDATA" value="true"/>
                <parameter key="assume_html" value="true"/>
                <list key="index_queries"/>
                <list key="jsonpath_queries"/>
                <process expanded="true">
                  <operator activated="true" class="text:extract_information" compatibility="8.2.000" expanded="true" height="68" name="Get_Data" width="90" x="380" y="34">
                    <parameter key="query_type" value="Regular Expression"/>
                    <list key="string_machting_queries"/>
                    <parameter key="attribute_type" value="Nominal"/>
                    <list key="regular_expression_queries">
                      <parameter key="Rank" value="&lt;.*data-stat=&quot;ranker&quot;.*&gt;(\d+)+&lt;/th&gt;"/>
                      <parameter key="Player" value="&lt;.*data-stat=&quot;player&quot;.*&gt;([\w\s]+)&lt;/a&gt;&lt;/td&gt;&lt;td.*data-stat=&quot;age&quot; &gt;"/>
                      <parameter key="Age" value="&lt;td.*data-stat=&quot;age&quot;.*&gt;(\d+)&lt;/td&gt;&lt;td.*&quot;team_id&quot;.*&gt;"/>
                    </list>
                    <list key="regular_region_queries">
                      <parameter key="Rank" value="&lt;\.*data-stat=&quot;ranker&quot;\.*&gt;.&lt;/th&gt;"/>
                      <parameter key="Player" value="&lt;\.*data-stat=&quot;player&quot;\.*&gt;.&lt;/th&gt;"/>
                    </list>
                    <list key="xpath_queries"/>
                    <list key="namespaces"/>
                    <parameter key="ignore_CDATA" value="true"/>
                    <parameter key="assume_html" value="true"/>
                    <list key="index_queries"/>
                    <list key="jsonpath_queries"/>
                  </operator>
                  <connect from_port="segment" to_op="Get_Data" to_port="document"/>
                  <connect from_op="Get_Data" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Rows" to_port="document"/>
              <connect from_op="Cut Rows" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Extract_Information" to_port="documents 1"/>
          <connect from_op="Extract_Information" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    edited September 2019
    hi @btibert ...perhaps this would be of interest if you are looking at NHL data? It is a LOT more granular (i.e. more fun!) than that stats page.

    https://community.rapidminer.com/discussion/44904/using-the-nhl-api-to-analyze-pro-ice-hockey-data-part-1

    Scott

  • btibertbtibert Member, University Professor Posts: 146 Guru
    Oh wow, thanks for the notes everyone.  I will give it a review.  As you might be able tell, I am now fully kicking the tires relative to my course and where I might be able to provide exercises.  The examples above show how truly powerful (and flexible) the tool is.  Thanks!
  • btibertbtibert Member, University Professor Posts: 146 Guru
    @sgenzer
    Thats a great link!  I have been playing around with the API for a while (in R and Python) so thats absolutely fantastic to see it here in this tool too!
Sign In or Register to comment.