At RapidMiner Research, we just released updates of multiple extensions developed under the DS4DM research project. Here is a highlight of these updates.
Web Table Extraction Extension
The new version is 0.1.6. In this version, the ‘Read HTML Table’ operator can load the HTML documents from local file path in addition to web URL. This is helpful when dealing with large amounts of HTML data files, that may have been collected through web crawling. Once the HTML data tables are retrieved and being converted into ExampleSets, the operator can also guess the numeric data type of attributes.
Spreadsheet Table Extraction Extension
The new version is 0.2.1. In this version, the following updates are available:
PDF Table Extraction Extension
The new version is 0.1.4. This also adds type guessing to the ‘Read PDF Table’ operator.
Data Search for Data Mining Extension
The new version is 0.1.2. This update includes various enhancements, most notable of them are made in the ‘Translate’ operator. The extension provides Search-Join mechanism through a joint usage of ‘Data Search’, ‘Translate’ and ‘Fuse’ operators. Translate filters out tables, that have schema and instance match for the new attribute you want to discover and integrate to your original (query) table. Before fusion is performed, the discovered tables are converted to the schema of the query table. This requires statistical measures of interest to be defined on the cell-level and table-level for the new attributes. In this update, we added metrics for defining “trust” in the new data by using similarity and dissimilarity for data discovered by the Data Search operator. To this, the following trust and mistrust measures have been added:
Other metrics include Coverage and Trust (please refer to the earlier post for more details ). The figure below shows the distributions of these metrics on the Control Panel view of the Translate operator.
This update paves the way to perform data fusion not just at data level (by using Voting, Clustered Voting, Intersection, etc.) but also advanced meta-data level such as by optimizing on multiple objectives.
The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).
 The Data Search for Data Mining, Release post, Web-link: http://community.rapidminer.com/t5/Community-Blog/The-Data-Search-for-Data-Mining-Extension-Release/...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.