read data from html tables on web pages

Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn
edited November 2018 in Help

There are many pages on the web that contain useful data in the form of simple html tables.  Here's an example:

https://en.wikipedia.org/wiki/List_of_metropolitan_areas_of_the_United_States

 

I know how to use RapidMiner to retrieve this data automatically in html form using "get page" and store it as a document, and I even know how to do this iteratively if a set of related pages are required.  I also have some familiarity with how to manipulate documents, but what I really want is to extract the information in the html table into a usable example set in RapidMiner.  Is there any relatively simple way of doing the following:  

  1. collect the table column headers and use them as attribute names
  2. collect each data row from the table and store it as an example
  3. identify and set the appropriate data type for each resulting attribute

 

It seems like it would be an incredibly useful operator that did all this automatically -  "HTML table to data" or something similar.  I'm fairly certain that such an operator doesn't exist (yet), but I'm not even sure of the collection of existing operators that would be required to do all of the above.  Any ideas @mschmitz or @Thomas_Ott ?

 

Thanks.

 

Brian T.
Lindon Ventures 
Data Science Consulting from Certified RapidMiner Experts

Best Answer

  • Telcontar120Telcontar120 Posts: 1,226   Unicorn
    Solution Accepted
    This is now solved by the Read HTML table operator.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    Hi @Telcontar120,

     

    Unfortunately there is no HTML table to Example set operator. I wish there was because there is much good info trapped in tables around the Interwebz.  The only work around I can see is using Xpath to extract the tables from their respective <div> and <td>, <tr> tags. 

     

    @mschmitz any other ideas?

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,113  RM Data Scientist

    What about HTML to XML and then Read XML?

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn

    I believe that is basically the xpath route that @Thomas_Ott is suggesting--which unfortunately involves a lot of manual fiddling with xpath expressions.  Plus I think you would actually need two separate streams, one to extract the column titles and apply those as attribute names, and another to extract the actual data for the examples.  So it's all possible, just not easy (and probably not for the faint of heart either).  Well, I know what I will be suggesting next in the product ideas forum :-)

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • robinrobin Member Posts: 87   Unicorn

    But how would you do that step of reding the html into xml?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    The Text Mining extension has an HTML to XML operator. Give that a try.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,226   Unicorn

    @Thomas_Ott If anyone comes up with an elegant solution to this, I am still interested.  My experience is that the HTML to XML is still quite messy.  In the meantime, my preferred workaround is using Google Sheets to read the html table from the URL (they have a very handy function for exactly that, much like what I was suggesting for RapidMiner).  Then I can export that sheet as a csv and read it into RapidMiner.  It would be even nicer if RapidMiner had the ability to dynamically read Google Sheets, but I haven't figured that out yet either...

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    Hmm, I recently did a proof of concept where I imported a RM webservice into Google Sheets and managed to score data in the Google Sheet by passing to to the Server. 

     

    I wonder if there's a way to cut out that export step you're using.

     

Sign In or Register to comment.