RapidMiner

RM Research
‎07-31-2017 10:30 AM
185 Views
0 Comments

At RapidMiner Research, we just released updates of multiple extensions developed under the DS4DM research project. Here is a highlight of these updates.

Read more...

At RapidMiner Research, we just released updates of multiple extensions developed under the DS4DM research project. Here is a highlight of these updates.

 

Web Table Extraction Extension

 

The new version is 0.1.6. In this version, the ‘Read HTML Table’ operator can load the HTML documents from local file path in addition to web URL. This is helpful when dealing with large amounts of HTML data files, that may have been collected through web crawling. Once the HTML data tables are retrieved and being converted into ExampleSets, the operator can also guess the numeric data type of attributes.

 

Spreadsheet Table Extraction Extension

 

The new version is 0.2.1. In this version, the following updates are available:

 

  • The ‘Read Google Spreadsheet’ operator provides type guessing so you can retrieve sheets from an online Google Spreadsheet document and directly process the numeric data.
  • The ‘Read Excel Online’ operator is the completely new operator added to the extension. This extends your reach to the Excel Online spreadsheets. There is a dedicated Blogpost detailing salient features of this operator available here: Link: Reading Excel files directly from your companies OneDrive

 

PDF Table Extraction Extension

 

The new version is 0.1.4. This also adds type guessing to the ‘Read PDF Table’ operator.

 

Data Search for Data Mining Extension

 

The new version is 0.1.2. This update includes various enhancements, most notable of them are made in the ‘Translate’ operator. The extension provides Search-Join mechanism through a joint usage of ‘Data Search’, ‘Translate’ and ‘Fuse’ operators. Translate filters out tables, that have schema and instance match for the new attribute you want to discover and integrate to your original (query) table. Before fusion is performed, the discovered tables are converted to the schema of the query table. This requires statistical measures of interest to be defined on the cell-level and table-level for the new attributes. In this update, we added metrics for defining “trust” in the new data by using similarity and dissimilarity for data discovered by the Data Search operator. To this, the following trust and mistrust measures have been added:

  • Levenstein Mistrust: Mean value of Levenstein cross-distance for each non-empty cell value present in the discovered collection.
  • Jaro Winkler Trust: Mean value of Jaro Winkler cross-distance for each non-empty cell value present in the discovered collection.
  • Fuzzy Trust: Mean value of Fuzzy cross-distance for each non-empty cell value present in the discovered collection.
  • Missing Values: The number of empty values in a translated table.

Other metrics include Coverage and Trust (please refer to the earlier post for more details [1]). The figure below shows the distributions of these metrics on the Control Panel view of the Translate operator.

 

The Control Panel view of the Translate operator shows list of translated tables and the distribution of statistical metrics such as Coverage, Ratio, Levenstein Mistrust, Jaro Winkler Trust, Fuzzy Trust and Missing Values to be used by the Fuse operatorThe Control Panel view of the Translate operator shows list of translated tables and the distribution of statistical metrics such as Coverage, Ratio, Levenstein Mistrust, Jaro Winkler Trust, Fuzzy Trust and Missing Values to be used by the Fuse operator

This update paves the way to perform data fusion not just at data level (by using Voting, Clustered Voting, Intersection, etc.) but also advanced meta-data level such as by optimizing on multiple objectives.

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

 

References

[1] The Data Search for Data Mining, Release post, Web-link: http://community.rapidminer.com/t5/Community-Blog/The-Data-Search-for-Data-Mining-Extension-Release/...

  • DS4DM
  • extensions
  • Research
RM Research
‎07-31-2017 10:12 AM
242 Views
0 Comments

Nowadays many corporate Excel files tend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the new Read Excel Online Operator. It is part of the latest update of the Spreadsheet Table Extraction extension (download link).

Read more...

Nowadays many corporate Excel files tend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the new Read Excel Online Operator. It is part of the latest update of the Spreadsheet Table Extraction extension (download link). Example Excel Online SheetExample Excel Online Sheet Above you can see an Excel Sheet stored in a OneDrive for Business instance. Its filename is 'Customer Data.xlsx', it contains one sheet called 'Sheet1' and note, that the 4th column is not given a name, yet. Now let's access this sheet. Open up RapidMiner and search the Marketplace (found in the Extensions menu item) for Spreadsheet Table Extraction. After the installation is complete, you will find a new Operator, called Read Excel Online in your Operator view. Drag it into your process and fill in the parameters. For the sheet shown above it will look like this:

Example process extracting data from OneDriveExample process extracting data from OneDrive

The File Path needed is the relative path in your OneDrive for Business. Unfortunately, Microsoft's API does not support private OneDrive accounts, yet. As you can see, my Excel Sheet is located directly in the upper most folder of my OneDrive for Business. If I would move it into a folder named 'Data', I would need to provide 'Data/Customer Data.xlsx' as the File Path. Note, that no leading slash is used.

The Sheet Name is set per default to Microsoft's current default sheet name. Make sure, that you provide the name of the sheet you intend to access. If you don't provide a Cell Range the whole data containing range will be used. Leading rows and columns that are empty, will be skipped. Providing a Cell Range is done using the A1 notation used in Excel. Selecting all cells starting from the 2nd column (B) and the 3rd row up to the 8th column (H) and the 10th row would require to provide 'B3:H10' as the Cell Range.

Since your OneDrive for Business is an internal resource, you also need to verify, that you have access to the Excel Sheet. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your OneDrive credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.

Executing the process results in the ExampleSet shown below:Resulting ExampleSet from the extraction processResulting ExampleSet from the extraction process

 

As you can see, all data was extracted successfully. Empty cells were filled up with missing values and the 4th column, which had no name, was given a generic name (here: Attribute 4). And the best part about it, the file was never completely downloaded to your machine. Only data present in the cells you wanted to access was send to your machine. So now you are able to extract only the data of interest while for example facing a large excel document with many sheets, without creating lots of temporary versions of the file on your machine.

In the rare case, that your Excel Sheet only contains data without any column names, check the No column names given option to generate generic attribute names instead of using the first data row as column names.

Advanced Integration

If you have documents stored in Microsoft SharePoint sites, you can use the SharePoint Connector extension (download link) to obtain a list of existing files and load them directly into your process. Therefore have a look at this articles explaining the usage of the extension, and extract the IDs of the desired files and the site itself as described there. Using these information you can provide them via Macros to the Read Excel Online Operator to directly access the data from your SharePoint even without downloading the files. Checking the Read from SharePoint parameter enables you to enter the SharePoint Site ID as well as the SharePoint File ID.

Read from SharePoint optionRead from SharePoint option

An example process using both extensions for direct access to SharePoint data in Excel Sheets is given below:

<process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="sharepoint_connector:list_files_sharepoint" compatibility="0.1.000" expanded="true" height="68" name="List SharePoint Files" width="90" x="246" y="238">
        <parameter key="SharePoint Site" value="SharepointTesting"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="238">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="filename.contains.\.xlsx"/>
        </list>
      </operator>
      <operator activated="true" class="extract_macro_from_annotation" compatibility="7.5.003" expanded="true" height="68" name="Extract Macro from Annotation" width="90" x="782" y="238">
        <parameter key="macro" value="siteId"/>
        <parameter key="annotation" value="sharepointSiteId"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="7.5.003" expanded="true" height="82" name="Loop Values" width="90" x="983" y="238">
        <parameter key="attribute" value="sharepointId"/>
        <parameter key="iteration_macro" value="fileId"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="spreadsheet_table_extraction:read_excel_online" compatibility="0.2.001" expanded="true" height="68" name="Read Excel Online" width="90" x="447" y="85">
            <parameter key="Read from SharePoint" value="true"/>
            <parameter key="SharePoint Site ID" value="%{siteId}"/>
            <parameter key="SharePoint File ID" value="%{fileId}"/>
          </operator>
          <connect from_op="Read Excel Online" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <description align="center" color="orange" colored="true" height="201" resized="true" width="223" x="381" y="29">Obtaining the SharePoint site and file IDs from previously set macros&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;Make sure to check the Sheet Name and to provide a valid token.</description>
        </process>
      </operator>
      <connect from_op="List SharePoint Files" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro from Annotation" to_port="object"/>
      <connect from_op="Extract Macro from Annotation" from_port="object" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <description align="left" color="orange" colored="false" height="322" resized="true" width="403" x="100" y="38">#1 Login to https://developer.microsoft.com/en-us/graph/graph-explorer and copy the URL from your browsers address bar into the &amp;quot;Auth Token&amp;quot; field.&lt;br&gt;#2 Go to your SharePoint and choose a site you want to access.&lt;br&gt;#3 Fill in the parameters 'SharePoint URL' with the URL you used to access your SharePoint and fill in the parameter 'SharePoint Site' with the name of the site you want to access.</description>
      <description align="center" color="orange" colored="false" height="322" resized="true" width="219" x="513" y="37">Select only entries containing Excle Sheets (ending with .xlsx).</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="185" x="737" y="37">Extract the SharePoint site ID from the annotated ExampleSet and provide it as a macro called 'siteId'</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="199" x="928" y="37">Loop over all remaining entries (only the ones being Excel Sheets) to access the SharePoint file IDs from the 'sharepointId' column and providing it via a macro called 'fileId'</description>
    </process>
  </operator>
</process>

 

Happy Mining,

Philipp for the RapidMiner Research Team

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

  • DS4DM
  • extensions
  • Research
RM Research
‎07-31-2017 09:49 AM
294 Views
1 Comment

Companies and organizations often store and share information via Microsoft SharePoint Sites. They are a great way of collecting and sharing information around a given topic. Many sites therefore contain lots of office documents and files in other formats. Integrating these information into a Data Mining process often involves manual searching through sites and folders as well as downloading files by hand. This isn't fast, nor simple. Therefore we created the SharePoint Connector extension to speed things up. You can download it through the RapidMiner Marketplace. It consists of the List SharePoint Files operator, that creates a list of all available files and folders and the Download from SharePoint Operator which downloads files of interest.

Read more...

Companies and organizations often store and share information via Microsoft SharePoint Sites. They are a great way of collecting and sharing information around a given topic. Many sites therefore contain lots of office documents and files in other formats. Integrating these information into a Data Mining process often involves manual searching through sites and folders as well as downloading files by hand. This isn't fast, nor simple. Therefore we created the SharePoint Connector extension to speed things up. You can download it through the RapidMiner Marketplace. It consists of the List SharePoint Files operator, that creates a list of all available files and folders and the Download from SharePoint Operator which downloads files of interest.

 

Below you can see the document section of a SharePoint site created for a little demonstration. This site groups together a project folder and a few documents with varying file format.Demo SharePoint SiteDemo SharePoint SiteThe first step for integrating your SharePoint data into your Data Mining process is to find out, what the SharePoint URL of your company or organization is. Just have a look into your browsers address bar and extract it along with your sites name. Both things are underlined in the picture above. Now enter these information into the List SharePoint Files Operator, that comes with the SharePoint Connector extension, as shown in the picture below.List SharePoint Files Operator configurationList SharePoint Files Operator configuration

Since your SharePoint site is an internal resource, you also need to verify, that you have access to the information. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your SharePoint credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.

 

If you now run the process an ExampleSet is created, that contains information about the files stored in the site you accessed. Below you can see the result from scanning my demonstration SharePoint site shown at the beginning of this post. The author and lastModifiedBy columns are redacted for this post.Result view containing all files and folders found in the siteResult view containing all files and folders found in the site

You gain information about the filename, its location within the site (path), a url for downloading it manually, the author's name, the creation date and time (creationDateTime), the person having modified it last (lastModifiedBy), the date and time of the last change (lastModificationDateTime), a unique sharepointId and the information if the entry is a folder or not. The Operator always scans files at the given folder level. If you need to dig deeper you can use the information derived above together with the Scan specific folder parameter to search for files and folders in a subfolder.

 

With this information you can for example filter out all entries created by a given author or of a desired file format in order to download them. Therefore you can add the Filter Examples operator or any other Operator to create a more specific list of files you want to download. Providing this list to the Download from SharePoint Operator enables you to download all files to the destination defined in the Download Path parameter or continue working on them by using the collection of files provided at its output port. An example process using this filtering is shown below and provided as a tutorial process, that comes with the Download from SharePoint Operator. File download and integrationFile download and integrationTo continue using the files directly in your process you can for example use the Loop Collection Operator to handle each file and use one of RapidMiner's many reading Operators to extract the data into your process. Don't worry, you don't need to provide the Auth Token to the Download from SharePoint Operator again. It will be stored alongside the ExampleSet (as an annotation) so you don't need to handle it again. But if you store the ExampleSet in your repository and want to download files later, your token might expire. Hence the operator offers an option to set a new token. Again you can just provide the URL obtained after logging into Microsoft Graph Explorer.

 

Happy Mining,

Philipp for the RapidMiner Research Team

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

  • DS4DM
  • extensions
  • Research