RapidMiner

RM Research
‎10-17-2017 10:06 AM
741 Views
2 Comments

Dear Community,

 

This is the first of a series of blog posts planned for the new Time Series Extension (Marketplace).

 

We, the RapidMiner Research Team, are pleased to announce a complete rebuild of the already popular Series Extension.

Read more...

Dear Community,

 

This is the first of a series of blog posts planned for the new Time Series Extension (Marketplace).

 

We, the RapidMiner Research Team, are pleased to announce a complete rebuild of the already popular Series ExtensionThe existing Series Extension was received with much excitement with advanced and cool features such as windowing, FFTs, lagging, and much more. However we have received your feedback about its difficulty to use and hence we are replacing the old Series Extension step-by-step with a completely new product.

 

The Alpha version is now available in the Marketplace: Time Series Extension and updates will be published as additional features are added.

 

The new Time Series Extension will add more functionality, tutorial processes and improved usability. We are also adding more features to the extension, including the possibility to forecast Time Series values using a (currently simple) ARIMA model.

In addition, the new Time Series Extension will come with a set of Time Series data samples and several template processes to explore.

 

In addition, we are focusing to publish this extension with even better documentation to help users with Time Series analysis, including transformations, forecasting, feature extraction and more.

 

Figure 1:  Sample process with documentation for the new Time Series Extension. The shown process is one of the template processes delivered by the extension.Figure 1: Sample process with documentation for the new Time Series Extension. The shown process is one of the template processes delivered by the extension.Figure 2: Results of the above shown sample process. The surface level of Lake Huron is shown, as well as two kinds of filters and a forecast using an ARIMA model.Figure 2: Results of the above shown sample process. The surface level of Lake Huron is shown, as well as two kinds of filters and a forecast using an ARIMA model.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

We plan to improve the extension in the next months and gradually include more features. We are also glad to get your feedback as soon as possible. So feel free to post every bug, usability problem, feature request or any feedback you have in the Product Feedback Area in the RapidMiner Community.

 

Stay tuned for the next blog post comprising details about the features provided by version 0.1.2 of the Time Series Extension!

 

 

Time Series Extension Blog Posts:

 

01 Release of the Alpha Version 0.1.2

02 Features of Version 0.1.2

  • extensions
  • Research
  • Time Series
RM Research
‎07-31-2017 10:30 AM
185 Views
0 Comments

At RapidMiner Research, we just released updates of multiple extensions developed under the DS4DM research project. Here is a highlight of these updates.

Read more...

At RapidMiner Research, we just released updates of multiple extensions developed under the DS4DM research project. Here is a highlight of these updates.

 

Web Table Extraction Extension

 

The new version is 0.1.6. In this version, the ‘Read HTML Table’ operator can load the HTML documents from local file path in addition to web URL. This is helpful when dealing with large amounts of HTML data files, that may have been collected through web crawling. Once the HTML data tables are retrieved and being converted into ExampleSets, the operator can also guess the numeric data type of attributes.

 

Spreadsheet Table Extraction Extension

 

The new version is 0.2.1. In this version, the following updates are available:

 

  • The ‘Read Google Spreadsheet’ operator provides type guessing so you can retrieve sheets from an online Google Spreadsheet document and directly process the numeric data.
  • The ‘Read Excel Online’ operator is the completely new operator added to the extension. This extends your reach to the Excel Online spreadsheets. There is a dedicated Blogpost detailing salient features of this operator available here: Link: Reading Excel files directly from your companies OneDrive

 

PDF Table Extraction Extension

 

The new version is 0.1.4. This also adds type guessing to the ‘Read PDF Table’ operator.

 

Data Search for Data Mining Extension

 

The new version is 0.1.2. This update includes various enhancements, most notable of them are made in the ‘Translate’ operator. The extension provides Search-Join mechanism through a joint usage of ‘Data Search’, ‘Translate’ and ‘Fuse’ operators. Translate filters out tables, that have schema and instance match for the new attribute you want to discover and integrate to your original (query) table. Before fusion is performed, the discovered tables are converted to the schema of the query table. This requires statistical measures of interest to be defined on the cell-level and table-level for the new attributes. In this update, we added metrics for defining “trust” in the new data by using similarity and dissimilarity for data discovered by the Data Search operator. To this, the following trust and mistrust measures have been added:

  • Levenstein Mistrust: Mean value of Levenstein cross-distance for each non-empty cell value present in the discovered collection.
  • Jaro Winkler Trust: Mean value of Jaro Winkler cross-distance for each non-empty cell value present in the discovered collection.
  • Fuzzy Trust: Mean value of Fuzzy cross-distance for each non-empty cell value present in the discovered collection.
  • Missing Values: The number of empty values in a translated table.

Other metrics include Coverage and Trust (please refer to the earlier post for more details [1]). The figure below shows the distributions of these metrics on the Control Panel view of the Translate operator.

 

The Control Panel view of the Translate operator shows list of translated tables and the distribution of statistical metrics such as Coverage, Ratio, Levenstein Mistrust, Jaro Winkler Trust, Fuzzy Trust and Missing Values to be used by the Fuse operatorThe Control Panel view of the Translate operator shows list of translated tables and the distribution of statistical metrics such as Coverage, Ratio, Levenstein Mistrust, Jaro Winkler Trust, Fuzzy Trust and Missing Values to be used by the Fuse operator

This update paves the way to perform data fusion not just at data level (by using Voting, Clustered Voting, Intersection, etc.) but also advanced meta-data level such as by optimizing on multiple objectives.

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

 

References

[1] The Data Search for Data Mining, Release post, Web-link: http://community.rapidminer.com/t5/Community-Blog/The-Data-Search-for-Data-Mining-Extension-Release/...

  • DS4DM
  • extensions
  • Research
RM Research
‎07-31-2017 10:12 AM
242 Views
0 Comments

Nowadays many corporate Excel files tend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the new Read Excel Online Operator. It is part of the latest update of the Spreadsheet Table Extraction extension (download link).

Read more...

Nowadays many corporate Excel files tend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the new Read Excel Online Operator. It is part of the latest update of the Spreadsheet Table Extraction extension (download link). Example Excel Online SheetExample Excel Online Sheet Above you can see an Excel Sheet stored in a OneDrive for Business instance. Its filename is 'Customer Data.xlsx', it contains one sheet called 'Sheet1' and note, that the 4th column is not given a name, yet. Now let's access this sheet. Open up RapidMiner and search the Marketplace (found in the Extensions menu item) for Spreadsheet Table Extraction. After the installation is complete, you will find a new Operator, called Read Excel Online in your Operator view. Drag it into your process and fill in the parameters. For the sheet shown above it will look like this:

Example process extracting data from OneDriveExample process extracting data from OneDrive

The File Path needed is the relative path in your OneDrive for Business. Unfortunately, Microsoft's API does not support private OneDrive accounts, yet. As you can see, my Excel Sheet is located directly in the upper most folder of my OneDrive for Business. If I would move it into a folder named 'Data', I would need to provide 'Data/Customer Data.xlsx' as the File Path. Note, that no leading slash is used.

The Sheet Name is set per default to Microsoft's current default sheet name. Make sure, that you provide the name of the sheet you intend to access. If you don't provide a Cell Range the whole data containing range will be used. Leading rows and columns that are empty, will be skipped. Providing a Cell Range is done using the A1 notation used in Excel. Selecting all cells starting from the 2nd column (B) and the 3rd row up to the 8th column (H) and the 10th row would require to provide 'B3:H10' as the Cell Range.

Since your OneDrive for Business is an internal resource, you also need to verify, that you have access to the Excel Sheet. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your OneDrive credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.

Executing the process results in the ExampleSet shown below:Resulting ExampleSet from the extraction processResulting ExampleSet from the extraction process

 

As you can see, all data was extracted successfully. Empty cells were filled up with missing values and the 4th column, which had no name, was given a generic name (here: Attribute 4). And the best part about it, the file was never completely downloaded to your machine. Only data present in the cells you wanted to access was send to your machine. So now you are able to extract only the data of interest while for example facing a large excel document with many sheets, without creating lots of temporary versions of the file on your machine.

In the rare case, that your Excel Sheet only contains data without any column names, check the No column names given option to generate generic attribute names instead of using the first data row as column names.

Advanced Integration

If you have documents stored in Microsoft SharePoint sites, you can use the SharePoint Connector extension (download link) to obtain a list of existing files and load them directly into your process. Therefore have a look at this articles explaining the usage of the extension, and extract the IDs of the desired files and the site itself as described there. Using these information you can provide them via Macros to the Read Excel Online Operator to directly access the data from your SharePoint even without downloading the files. Checking the Read from SharePoint parameter enables you to enter the SharePoint Site ID as well as the SharePoint File ID.

Read from SharePoint optionRead from SharePoint option

An example process using both extensions for direct access to SharePoint data in Excel Sheets is given below:

<process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="sharepoint_connector:list_files_sharepoint" compatibility="0.1.000" expanded="true" height="68" name="List SharePoint Files" width="90" x="246" y="238">
        <parameter key="SharePoint Site" value="SharepointTesting"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="238">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="filename.contains.\.xlsx"/>
        </list>
      </operator>
      <operator activated="true" class="extract_macro_from_annotation" compatibility="7.5.003" expanded="true" height="68" name="Extract Macro from Annotation" width="90" x="782" y="238">
        <parameter key="macro" value="siteId"/>
        <parameter key="annotation" value="sharepointSiteId"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="7.5.003" expanded="true" height="82" name="Loop Values" width="90" x="983" y="238">
        <parameter key="attribute" value="sharepointId"/>
        <parameter key="iteration_macro" value="fileId"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="spreadsheet_table_extraction:read_excel_online" compatibility="0.2.001" expanded="true" height="68" name="Read Excel Online" width="90" x="447" y="85">
            <parameter key="Read from SharePoint" value="true"/>
            <parameter key="SharePoint Site ID" value="%{siteId}"/>
            <parameter key="SharePoint File ID" value="%{fileId}"/>
          </operator>
          <connect from_op="Read Excel Online" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <description align="center" color="orange" colored="true" height="201" resized="true" width="223" x="381" y="29">Obtaining the SharePoint site and file IDs from previously set macros&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;Make sure to check the Sheet Name and to provide a valid token.</description>
        </process>
      </operator>
      <connect from_op="List SharePoint Files" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro from Annotation" to_port="object"/>
      <connect from_op="Extract Macro from Annotation" from_port="object" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <description align="left" color="orange" colored="false" height="322" resized="true" width="403" x="100" y="38">#1 Login to https://developer.microsoft.com/en-us/graph/graph-explorer and copy the URL from your browsers address bar into the &amp;quot;Auth Token&amp;quot; field.&lt;br&gt;#2 Go to your SharePoint and choose a site you want to access.&lt;br&gt;#3 Fill in the parameters 'SharePoint URL' with the URL you used to access your SharePoint and fill in the parameter 'SharePoint Site' with the name of the site you want to access.</description>
      <description align="center" color="orange" colored="false" height="322" resized="true" width="219" x="513" y="37">Select only entries containing Excle Sheets (ending with .xlsx).</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="185" x="737" y="37">Extract the SharePoint site ID from the annotated ExampleSet and provide it as a macro called 'siteId'</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="199" x="928" y="37">Loop over all remaining entries (only the ones being Excel Sheets) to access the SharePoint file IDs from the 'sharepointId' column and providing it via a macro called 'fileId'</description>
    </process>
  </operator>
</process>

 

Happy Mining,

Philipp for the RapidMiner Research Team

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

  • DS4DM
  • extensions
  • Research
RM Research
‎07-31-2017 09:49 AM
294 Views
1 Comment

Companies and organizations often store and share information via Microsoft SharePoint Sites. They are a great way of collecting and sharing information around a given topic. Many sites therefore contain lots of office documents and files in other formats. Integrating these information into a Data Mining process often involves manual searching through sites and folders as well as downloading files by hand. This isn't fast, nor simple. Therefore we created the SharePoint Connector extension to speed things up. You can download it through the RapidMiner Marketplace. It consists of the List SharePoint Files operator, that creates a list of all available files and folders and the Download from SharePoint Operator which downloads files of interest.

Read more...

Companies and organizations often store and share information via Microsoft SharePoint Sites. They are a great way of collecting and sharing information around a given topic. Many sites therefore contain lots of office documents and files in other formats. Integrating these information into a Data Mining process often involves manual searching through sites and folders as well as downloading files by hand. This isn't fast, nor simple. Therefore we created the SharePoint Connector extension to speed things up. You can download it through the RapidMiner Marketplace. It consists of the List SharePoint Files operator, that creates a list of all available files and folders and the Download from SharePoint Operator which downloads files of interest.

 

Below you can see the document section of a SharePoint site created for a little demonstration. This site groups together a project folder and a few documents with varying file format.Demo SharePoint SiteDemo SharePoint SiteThe first step for integrating your SharePoint data into your Data Mining process is to find out, what the SharePoint URL of your company or organization is. Just have a look into your browsers address bar and extract it along with your sites name. Both things are underlined in the picture above. Now enter these information into the List SharePoint Files Operator, that comes with the SharePoint Connector extension, as shown in the picture below.List SharePoint Files Operator configurationList SharePoint Files Operator configuration

Since your SharePoint site is an internal resource, you also need to verify, that you have access to the information. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your SharePoint credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.

 

If you now run the process an ExampleSet is created, that contains information about the files stored in the site you accessed. Below you can see the result from scanning my demonstration SharePoint site shown at the beginning of this post. The author and lastModifiedBy columns are redacted for this post.Result view containing all files and folders found in the siteResult view containing all files and folders found in the site

You gain information about the filename, its location within the site (path), a url for downloading it manually, the author's name, the creation date and time (creationDateTime), the person having modified it last (lastModifiedBy), the date and time of the last change (lastModificationDateTime), a unique sharepointId and the information if the entry is a folder or not. The Operator always scans files at the given folder level. If you need to dig deeper you can use the information derived above together with the Scan specific folder parameter to search for files and folders in a subfolder.

 

With this information you can for example filter out all entries created by a given author or of a desired file format in order to download them. Therefore you can add the Filter Examples operator or any other Operator to create a more specific list of files you want to download. Providing this list to the Download from SharePoint Operator enables you to download all files to the destination defined in the Download Path parameter or continue working on them by using the collection of files provided at its output port. An example process using this filtering is shown below and provided as a tutorial process, that comes with the Download from SharePoint Operator. File download and integrationFile download and integrationTo continue using the files directly in your process you can for example use the Loop Collection Operator to handle each file and use one of RapidMiner's many reading Operators to extract the data into your process. Don't worry, you don't need to provide the Auth Token to the Download from SharePoint Operator again. It will be stored alongside the ExampleSet (as an annotation) so you don't need to handle it again. But if you store the ExampleSet in your repository and want to download files later, your token might expire. Hence the operator offers an option to set a new token. Again you can just provide the URL obtained after logging into Microsoft Graph Explorer.

 

Happy Mining,

Philipp for the RapidMiner Research Team

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

  • DS4DM
  • extensions
  • Research
RM Research
‎05-26-2017 03:26 AM
370 Views
0 Comments

We are currently working on some new data integration and enrichment operators to aid your data mining journey. Therefore, we are running a case study for testing our latest findings. Within this study you are given a short introduction with some guiding material to help you test some new RapidMiner operators. That’s it! Just test the operators in your current environment and tell us your findings and ideas.

Read more...

We are currently working on some new data integration and enrichment operators to aid your data mining journey. Therefore, we are running a case study for testing our latest findings. Within this study you are given a short introduction with some guiding material to help you test some new RapidMiner operators. That’s it! Just test the operators in your current environment and tell us your findings and ideas.

 

Objectives to test

 

If you are interested contact us via research@rapidminer.com and we will get in touch with you. You are asked to hand in observations within four weeks.

 

Background

Finding the right data is a crucial step in every data mining project. Data is often distributed across different places and obtaining it might be difficult. Hence integrating various formats and sources is key. In the research project ‘Data Search for Data Mining’ we are investigating new ways of aiding this process by making data from previously unavailable sources easily available within RapidMiner. Possible sources are for example tables stored in PDFs or Google Spreadsheets.

And what about those data sets you already have but are unaware of? For that, we’re working together with the University of Mannheim to enrich data sets with existing data from internet and intranet sources in a (semi-)automatic way.

 

Happy Mining,

Philipp for the RapidMiner Research Team

  • Case Study
  • extensions
  • Research
RM Certified Expert
‎04-17-2017 08:08 AM
264 Views
0 Comments

One of the cool things about RapidMiner is the extension ecoystem. The default installation of RapidMiner Studio has a complete suite to do 90% of any ETL, Modeling, and Testing that you need to do on a daily basis. Sometimes you'll need that extra 10% to do something special, like Text Mining!

Read more...

One of the cool things about RapidMiner is the extension ecoystem. The default installation of RapidMiner Studio has a complete suite to do 90% of any ETL, Modeling, and Testing that you need to do on a daily basis. Sometimes you'll need that extra 10% to do something special, like Text Mining!
 
This is where extensions come in. Extension are just that, they extend the capability of RapidMiner Studio and Server in some way. Want to do Time Series? Then download the Series extension. Want to do R or Python scripting? Then download either the R or Python Scripting extension.
 
There are close to 100 different extensions available at the RapidMiner Marketplace. Each one is easy to install from the RapidMiner Studio interface. Look for the "Extensions" pull down menu.
 
RapidMiner Supported
 
There are many extensions on the marketplace. Some are made by RapidMiner, and others by third parties. RapidMiner supports (i.e. fixes problems with the extension) a few of these extensions. To find out, look for the blue check mark and the words "supported" next to the extension name. We'll fix bugs and make updates to that extension.
 
Blue check mark means supported!Blue check mark means supported!Then there are extensions without a blue checkmark. Those extensions were built by third parties such as Basis Tech or Old World Computing. Usually those extensions are supported by those third parties.
 
I bring this up because sometimes Community members have an issue with a 3rd party extension. The best way to try and resolve this issue is to make a post in the Community and ping the 3rd party developer. This way both the Community and the 3rd party developer can learn about the problem and try to resolve it quickly.
 
  • extensions
  • Marketplace
RM Certified Expert
‎04-24-2017 09:02 AM
512 Views
0 Comments

A few weeks ago the RapidMiner Research Team published two new extensions to the Marketplace that are making a splash, the Operator Toolbox and Converters! We didn't stop there! Today I'm happy to announce the release of the version 0.2.0 for both extensions!  Here's a quick preview of the new enhancements you'll find!

Read more...

By: Fabian Temme PhD.

 

A few weeks ago the RapidMiner Research Team published two new extensions to the Marketplace that are making a splash, the Operator Toolbox and Converters! We didn't stop there! Today I'm happy to announce the release of the version 0.2.0 for both extensions!  

 

New in the Operator Toolbox!

 

Introducing the Get Decision Tree Path.

 

Do you want to know why a Decision Tree classify a specific Example in the way it does? With this Operator, you can find out how. It works rather similar as the ‘Apply Model’ Operator, it takes a trained Decision Tree and an Example Set at its Input Port. But instead of calculating the confidence of the examples, the Operator creates a new Attribute holding the Path the corresponding Example takes in the Decision Tree.

 

This example process applies the Operator on a Decision Tree trained on the Golf data sample.

Get Decision Tree Path ProcessGet Decision Tree Path Process

 Once the process executes, here are the results:

 

Get Decision Tree Path ResultGet Decision Tree Path Result

 

 Introducing the Generate Date Series Operator

 

 

Do you need some Date Time Series data, covering a specific time range and interval? Now you can ge that from the new Generate Date Series Operator. You can specify the start and end date and the interval from years down to milliseconds!

 

Check out the statistics overview for a daily date series for the year 2012, created with the new Operator:Get Date Series Operator ResultsGet Date Series Operator Results

 

Introducing the Get Parameters Operator

 

Sometimes you want to retrieve all Parameters of an Operator in your Process. For example, if you want to store all Parameters of the trained model inside an Optimize Operator, not only the ones which were optimized.

 

The new Get Parameters Operator enables you to do so. You specify the name of the Operator whose Parameters you want to extract and the Operator creates a new Parameter Set containing all Parameters of the specified Operator.

 

Here an example of the Parameters of a Decision Tree Operator:

 

Parameter Set Operator resultsParameter Set Operator results

 

The Parameter Set can now be stored in the Repository, written to file with the Write Parameters Operator or used by the Set Parameters Operator. You can even convert it into an Example Set using the new Parameter Set to ExampleSet Operator in the version 0.2.0 of the Converters Extension.

 

New in the Converters! 

 

Introducting the Parameter Set to ExampleSet Operator

 

This new Operator converts a given Parameter Set (coming from an Optimize Operator or for example from the Get Parameters Operator in the Operator Toolbox Extension) into an Example Set.

 

The resulting ExampleSet contain one Attribute for each Parameter in the Parameter Set. You can even let the Operator try to estimate the type of the Parameter (Integer, Real, Nominal).

 

This is the result of the converted Parameter Set of an Optimized Decision Tree:

 

Parameter to Exampleset OperatorParameter to Exampleset Operator

 

And finally...

 

Introducing the Normalization to ExampleSet Operator

 

Do you ever want to get a Normalization model as an ExampleSet? Now you can! This Operator takes the preprocessing model and creates an ExampleSet with the corresponding Attributes out of it.

 

Here the results for each of the four Normalization methods (Z-transformation, range transformation, proportion transformation and interquartile range, respectively) on the Golf dataset:

 

Normalization to Exampleset OperatorNormalization to Exampleset Operator

Thanks to everyone in the community for downloading our Operator Toolbox and Convertor Extensions. We welcome any comments or questions you have. Just post them in the community forums! If you have any ideas for new operators for the toolbox or converter, just visit our Product Ideas forum and post your wish list items there. 

 

 

 

 

 

 

  • Converters
  • Decision Tree
  • extensions
  • parameters
  • Toolbox
RM Certified Expert
‎02-27-2017 09:32 AM
1263 Views
3 Comments

As RapidMiner users we are used to one operator solutions. Want to add a PCA? Add the operator. Want to do an ensemble? Add the operator. Over time the RapidMiner ecosystem evolved in a way that most tasks are easy to handle like this. However, doing data science every day, I experienced a few things where RapidMiner has no one operator solution. How do we solve that?

Read more...

By: Martin Schmitz, PhD

 

As RapidMiner users we are used to one operator solutions. Want to add a PCA? Add the operator. Want to do an ensemble? Add the operator. Over time the RapidMiner ecosystem evolved in a way that most tasks are easy to handle like this. However, doing data science every day, I experienced a few things where RapidMiner has no one operator solution. How do we solve that?

 

In this case you can use the scripting interfaces, build a building block, or write your own extension. The extension might be the slowest way but it has the clear benefit of making your results easily usable for others. Recently, I've joined forces with the RapidMiner Research Team and we want to share our tools with you - the community. The result are two new extensions packed with new tools making your life easier.

 

I am happy to introduce the Operator Toolbox and the Converter Extensions!

 

Generate Levenshtein Distance
In text analytics you often challenge the problem of misspelled words. One of the most common ways to find misspelled words is to use a distance between the two words. The most frequently used distance measure is the Levenshtein Distance. The Levenshtein distance is defined as the minimum number of single-character edits to transform one string into another.


Levenshtein.png
This can be used to generate a replacement dictionary.

 

Generate Phonetic Encoding


During text processing you might encounter the problem that words are differently spelled but pronounced the same way. Often you want to map these words to the same string. A good example are names like like Jennie, Jenny and Jenni. Algorithms doing these kind of encodings are called phonetic encoders. Scott Genzer posted a building block on our Community Portal to generate the Daitch-Mokotoff Soundex encoding. Driven by this we created an operator which can use various algorithms to do this kind of encoding.

 

Phonetic - 1.png

 

A typical result in depicted above. The current version of the operator supports a broad range of possible algorithms namely: BeiderMorse, Caverphone2, Cologne Phonetic, Double Metaphone, Metaphone, NYSIIS, Refined Soundex, Soundex.

 

Tukey Test


When is a value an outlier? is one of the most frequently asked question in anomaly detection. No matter if you do univariate outlier detection on single attributes or use RapidMiner's Anomaly Detection extension to generate a multivariate score - you still need to define a threshold. A common technique to do this is the Tukey Test (or criterion). It results in a outlier flag as well as a confidence for each example. It can also be applied on several attributes at a time.

 

Tukey.png

 

Group Into Collection
This operator enables you to split an ExampleSet into various ExampleSets using a Group By. The result is a collection of ExampleSets. This can be used in combination with a Loop Collection to apply arbitrary functions with a group by statement. A possible example would be to find the last 3 transaction for each customer in transactional data.

 

 

 Loop Collection_cropped.png

 

  

 

Get Last Modifying Operator

If you dive a bit deeper into modelling you might want to try different feature selection techniques and treat it as a parameter of your modelling process. This can be achieved using a Select Subprocess in a Optimize Parameters Operator. In order to add figure out which Feature Selection technique has won you would need to add at least one additional operator per method. To overcome this it is possible to extract the last modifying Operator for every object. This way you can easier annotate which feature selection technique was the best.

 

Extracting PCA, Association Rules, ROC

 

The Converter extension let's you do a lot of things that our users have asked for. Want to extract those Association Rules? You can do that now.  What to extract PCA results into a exampleset table? You can do that now.  Just check out the extension on the Marketplace to see all the neat things you can do now.

  • extensions