read data from html tables
There are many pages on the web that contain useful data in the form of simple html tables. Here's an example:
RapidMiner can be used to retrieve data automatically in html form using "get page" and store it as a document, and can even do this iteratively if a set of related pages are required. But what users often want to do is to extract the information in the html table into a usable example set in RapidMiner. So an operator should be created that does the following:
- collect the table column headers and use them as attribute names
- collect each data row from the table and store it as an example
- identify and set the appropriate data type for each resulting attribute
It seems like it would be an incredibly useful operator that did all this automatically - "HTML table to data" or something similar. In theory this could be similar to the read csv operator, with a small wizard to identify the table, the columns, set the data types, etc.
P.S. I know it is technically feasible using a series of xpath expressions in the read xml operator, but after consulting with some RM product experts in the general studio forum, the consensus is that it is still a multi-step process that requires good knowledge of xpath parsing under current options. So adding a single operator that did all the parsing, renaming, etc., would be a significant improvement.
Data Science Consulting from Certified RapidMiner Experts