🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉
RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance
The Web Table Extraction Operator
By: Edwin Yaqub, Phd
Within the RapidMiner Research team, I’m developing extensions that target data enrichment and extraction as part of my work on the research project DS4DM (Data Search for Data Mining, http://ds4dm.de), so data mining processes would produce improved results. Today we have released the ‘Web Table Extraction’ extension on the Marketplace and here is an introduction to it.
Problem: Data scientists are often confronted with a situation where data must be read from web pages. For instance, there are a lot of data tables available on Wikipedia, which can be utilized but the fine-grained data scraping approaches get complicated for ordinary users as they often require regular expressions based parsing and extraction of data from a web page’s content.
Solution: To ease this task, the ‘Web Table Extraction’ extension offers a convenient alternative to extract data tables from Wiki-like websites and converts them to RapidMiner example sets.
You simply provide a url of the web page e.g.  to the ‘Read HTML Table’ operator and execute the process. Bingo! The operator extracted 9 data tables as example sets in the blink of an eye.
Read HTML Table Results
Example: Now that we have an encyclopedia at our disposal, let us use a simple example. One of the tables on  gives the GDP (Gross Domestic Product) values for past years and projections for the future. GDP is a measure of a country’s economic activity. Another table on the same page gives us GDP per capita, which can be interpreted as the productivity of a country’s work force or their affluence. I’d like to see how these values are affected between 2015 and 2020. I’m also curious to see if affluence relates to obesity levels. For latter, we can use the BMI data at this web page .
Thanks to ‘Read HTML Table’ operator, we got the tables as example sets. Next, we apply inner join on GDP, GDP per capita and the BMI tables using the Country attribute. Here is the snapshot of the RapidMiner process for this (the process file is attached as well):
Extract HTML Table Process
We perform basic pre-processing to rename numeric attributes to be descriptive, we replace comma from attribute values before applying the Guess Types operator, which assigns integer and real data types to our attributes so we can process them. Finally, we filter out six attributes of interest.
A picture is worth a thousand words
The Results view of RapidMiner Studio provides an Advanced Charts module. This is excellent to visualize our dataset. We drag the attribute 2015_gdp on the domain dimension (the x-axis), the attributes 2015_per_capita and 2020_per_capita are dragged to a Numerical axis. These now appear on the left vertical axis. Next, we drag the 2020_gdp attribute as a new Numerical axis. This makes it appear on the right vertical axis. We use Country as the Color dimension and yes you guessed it, we use Obesity as the Size dimension – hence, the higher the obesity percentage, the bigger the legend.
This multi-series plot provides insights in a glance. The squares show how the GDP of countries compares between 2015 and 2020. The vertical lift between the triangles and the circles shows how the per capita income will increase from 2015 to 2020. Japan’s growth is highest among the industrialized nations. Assuming obesity levels stay same, we see that highly affluent nation like US has the highest obesity (33.7%) but again Japan provides a counter example (3.3%). We also see that lesser affluent nations can have high obesity. Based on these quick data-driven insights, we can now consider other attributes, perhaps related to culture, eating or work habits to understand the causes of obesity.
In this post, you learned how the new extension ‘Web Table Extraction’ can support in conveniently extracting data tables from Wiki-like pages. You also learned how the originally disparate data can now be unified in RapidMiner and displayed as a multi-series visualization using the Advanced Charts module. To try out yourself, go ahead and download the extension from the Marketplace and then try the attached process below. Have fun!