The Web Table Extraction Operator

Thomas_Ott · March 2017

By: Edwin Yaqub, Phd

Within the RapidMiner Research team, I’m developing extensions that target data enrichment and extraction as part of my work on the research project DS4DM (Data Search for Data Mining, http://ds4dm.de), so data mining processes would produce improved results. Today we have released the ‘Web Table Extraction’ extension on the Marketplace and here is an introduction to it.

Problem: Data scientists are often confronted with a situation where data must be read from web pages. For instance, there are a lot of data tables available on Wikipedia, which can be utilized but the fine-grained data scraping approaches get complicated for ordinary users as they often require regular expressions based parsing and extraction of data from a web page’s content.

Solution: To ease this task, the ‘Web Table Extraction’ extension offers a convenient alternative to extract data tables from Wiki-like websites and converts them to RapidMiner example sets.

You simply provide a url of the web page e.g. [1] to the ‘Read HTML Table’ operator and execute the process. Bingo! The operator extracted 9 data tables as example sets in the blink of an eye.

Image 1.png Read HTML Table Results

Example: Now that we have an encyclopedia at our disposal, let us use a simple example. One of the tables on [1] gives the GDP (Gross Domestic Product) values for past years and projections for the future. GDP is a measure of a country’s economic activity. Another table on the same page gives us GDP per capita, which can be interpreted as the productivity of a country’s work force or their affluence. I’d like to see how these values are affected between 2015 and 2020. I’m also curious to see if affluence relates to obesity levels. For latter, we can use the BMI data at this web page [2].

Thanks to ‘Read HTML Table’ operator, we got the tables as example sets. Next, we apply inner join on GDP, GDP per capita and the BMI tables using the Country attribute. Here is the snapshot of the RapidMiner process for this (the process file is attached as well):

Image 2.png Extract HTML Table Process

We perform basic pre-processing to rename numeric attributes to be descriptive, we replace comma from attribute values before applying the Guess Types operator, which assigns integer and real data types to our attributes so we can process them. Finally, we filter out six attributes of interest.

A picture is worth a thousand words

The Results view of RapidMiner Studio provides an Advanced Charts module. This is excellent to visualize our dataset. We drag the attribute 2015_gdp on the domain dimension (the x-axis), the attributes 2015_per_capita and 2020_per_capita are dragged to a Numerical axis. These now appear on the left vertical axis. Next, we drag the 2020_gdp attribute as a new Numerical axis. This makes it appear on the right vertical axis. We use Country as the Color dimension and yes you guessed it, we use Obesity as the Size dimension – hence, the higher the obesity percentage, the bigger the legend.

This multi-series plot provides insights in a glance. The squares show how the GDP of countries compares between 2015 and 2020. The vertical lift between the triangles and the circles shows how the per capita income will increase from 2015 to 2020. Japan’s growth is highest among the industrialized nations. Assuming obesity levels stay same, we see that highly affluent nation like US has the highest obesity (33.7%) but again Japan provides a counter example (3.3%). We also see that lesser affluent nations can have high obesity. Based on these quick data-driven insights, we can now consider other attributes, perhaps related to culture, eating or work habits to understand the causes of obesity.

Image 3.png Obesity Chart

Conclusion

In this post, you learned how the new extension ‘Web Table Extraction’ can support in conveniently extracting data tables from Wiki-like pages. You also learned how the originally disparate data can now be unified in RapidMiner and displayed as a multi-series visualization using the Advanced Charts module. To try out yourself, go ahead and download the extension from the Marketplace and then try the attached process below. Have fun!

References

[1] https://en.wikipedia.org/wiki/BRIC

[2] https://en.wikipedia.org/wiki/List_of_countries_by_Body_Mass_Index_(BMI)

Telcontar120 · March 2017

Edwin, thanks, this is terrific functionality! It is something that I have been looking forward to for a long time in RapidMiner and it will make web mining much easier.

In doing some early testing, I have noticed one problem with the operator, which is that it sometimes fails because of duplicate attribute names. Are you able to modify the code to automatically rename duplicate or blank columns on the back-end to avoid this problem? In some cases, it seems to work ok, but in others it does not. See for example this page, which fails to load the tables: https://www.bullionvault.com/audit.do

Thanks again for this wonderful addition to the RapidMiner toolkit!

ey · March 2017

Hi Telcontar120,

Are you using the latest version? The current version is 0.1.4. which can handle duplicate attribute names.

I would like to add that currently the operator is designed for Wiki-like pages only, although in this case it did retrieve the tables from your provided link.

Cheers,

Ed

screenshot:

Telcontar120 · March 2017

Ed, terrific! I must have been using the prior version, because once I updated via the marketplace then I see now that it does work properly. Thanks for the quick response. Brian

ey · March 2017

Hi Brian,

You're welcome and good to know it helps.

Cheers,

Ed

tatsuho · February 2018

Hi, Edwin,

I have found one problem in this wonderful functionality.

In case that html table has the header and only one row of data, Read HTML Table operator fails to extract table.

I am using the latest version.

I would appreciate if you could fix it.

Thanks in advance,

Tatsuya

ey · February 2018

Hi Tatsuya,

Thanks for the feedback, but this behaviour to ignore 1 row tables is intentional. The classification model used by this operator was trained on lots of html table data and the explanation here is that a single tuple table does not qualify the table to be data-containing when the overall corpus is considered - even though it may sound a bit odd. This is the current state of affairs but we might create exceptions depending on feasibility in future.

Thanks and Best Regards,

Edwin

Telcontar120 · February 2018

If there is only one row of data then it probably isn't hard to replicate that data manually in RapidMiner anyways, just one operator and not much more work than the "read html table" one!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

The Web Table Extraction Operator

Answers