RapidMiner can be used to retrieve data automatically in html form using "get page" and store it as a document, and can even do this iteratively if a set of related pages are required. But what users often want to do is to extract the information in the html table into a usable example set in RapidMiner. So an operator should be created that does the following:
collect the table column headers and use them as attribute names
collect each data row from the table and store it as an example
identify and set the appropriate data type for each resulting attribute
It seems like it would be an incredibly useful operator that did all this automatically - "HTML table to data" or something similar. In theory this could be similar to the read csv operator, with a small wizard to identify the table, the columns, set the data types, etc.
P.S. I know it is technically feasible using a series of xpath expressions in the read xml operator, but after consulting with some RM product experts in the general studio forum, the consensus is that it is still a multi-step process that requires good knowledge of xpath parsing under current options. So adding a single operator that did all the parsing, renaming, etc., would be a significant improvement.