read data from html tables

Status: New
by ‎08-30-2016 10:06 AM

There are many pages on the web that contain useful data in the form of simple html tables.  Here's an example:



RapidMiner can be used to retrieve data automatically in html form using "get page" and store it as a document, and can even do this iteratively if a set of related pages are required.  But what users often want to do is to extract the information in the html table into a usable example set in RapidMiner.  So an operator should be created that does the following:   

  1. collect the table column headers and use them as attribute names
  2. collect each data row from the table and store it as an example
  3. identify and set the appropriate data type for each resulting attribute

It seems like it would be an incredibly useful operator that did all this automatically -  "HTML table to data" or something similar.    In theory this could be similar to the read csv operator, with a small wizard to identify the table, the columns, set the data types, etc.


P.S.  I know it is technically feasible using a series of xpath expressions in the read xml operator, but after consulting with some RM product experts in the general studio forum, the consensus is that it is still a multi-step process that requires good knowledge of xpath parsing under current options.  So adding a single operator that did all the parsing, renaming, etc., would be a significant improvement.

Elite III

This sounds a little lke a simple version of how Diffbot works.  Have you tried out their service?  

I agree having this as a simple operator would be pretty handy. 




Hi all,


my colleague Edwin Yaqub recently developed an extension for this use case. Maybe you can check this out and give us some feedback?


Here is the link to the RapidMiner blog post. The extension is available on the marketplace.



Best regards,