By: Martin Schmitz, PhD
As RapidMiner users we are used to one operator solutions. Want to add a PCA? Add the operator. Want to do an ensemble? Add the operator. Over time the RapidMiner ecosystem evolved in a way that most tasks are easy to handle like this. However, doing data science every day, I experienced a few things where RapidMiner has no one operator solution. How do we solve that?
In this case you can use the scripting interfaces, build a building block, or write your own extension. The extension might be the slowest way but it has the clear benefit of making your results easily usable for others. Recently, I've joined forces with the RapidMiner Research Team and we want to share our tools with you - the community. The result are two new extensions packed with new tools making your life easier.
Generate Levenshtein Distance
In text analytics you often challenge the problem of misspelled words. One of the most common ways to find misspelled words is to use a distance between the two words. The most frequently used distance measure is the Levenshtein Distance. The Levenshtein distance is defined as the minimum number of single-character edits to transform one string into another.
This can be used to generate a replacement dictionary.
Generate Phonetic Encoding
During text processing you might encounter the problem that words are differently spelled but pronounced the same way. Often you want to map these words to the same string. A good example are names like like Jennie, Jenny and Jenni. Algorithms doing these kind of encodings are called phonetic encoders. Scott Genzer posted a building block on our Community Portal to generate the Daitch-Mokotoff Soundex encoding. Driven by this we created an operator which can use various algorithms to do this kind of encoding.
A typical result in depicted above. The current version of the operator supports a broad range of possible algorithms namely: BeiderMorse, Caverphone2, Cologne Phonetic, Double Metaphone, Metaphone, NYSIIS, Refined Soundex, Soundex.
When is a value an outlier? is one of the most frequently asked question in anomaly detection. No matter if you do univariate outlier detection on single attributes or use RapidMiner's Anomaly Detection extension to generate a multivariate score - you still need to define a threshold. A common technique to do this is the Tukey Test (or criterion). It results in a outlier flag as well as a confidence for each example. It can also be applied on several attributes at a time.
Group Into Collection
This operator enables you to split an ExampleSet into various ExampleSets using a Group By. The result is a collection of ExampleSets. This can be used in combination with a Loop Collection to apply arbitrary functions with a group by statement. A possible example would be to find the last 3 transaction for each customer in transactional data.
Get Last Modifying Operator
If you dive a bit deeper into modelling you might want to try different feature selection techniques and treat it as a parameter of your modelling process. This can be achieved using a Select Subprocess in a Optimize Parameters Operator. In order to add figure out which Feature Selection technique has won you would need to add at least one additional operator per method. To overcome this it is possible to extract the last modifying Operator for every object. This way you can easier annotate which feature selection technique was the best.
Extracting PCA, Association Rules, ROC
The Converter extension let's you do a lot of things that our users have asked for. Want to extract those Association Rules? You can do that now. What to extract PCA results into a exampleset table? You can do that now. Just check out the extension on the Marketplace to see all the neat things you can do now.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.