Version 0.8 of the Operator Toolbox Extension: Replace Rare Values, Merge, Parametric Probability
We are glad that we could prepare a small (Pre-) Christmas present for all of you and release the new version 0.8.0 of the Operator Toolbox Extension. There are some new operators as well as improvements for already existing ones for you to make your real data science projects even more "fast and simple."
Without further introductions, these are the changes which come with the new version:
new Replace Rare Values operator
This new Blending Operator enables you to automatically replace values of nominal attributes which are rare. This can be helpful for attributes with many different strings, some frequent specific strings, and some infrequent ones. These infrequent strings are often difficult for a machine learning algorithm to handle because no general rule can be learned on it. So our new Replace Rare Values operator replaces such infrequent strings with a generic, configurable string. Rare values can be defined by an absolute or relative threshold based on the number of occurrences in the data.
Figure 1 demonstrates the effect on a randomly-generated direct mailing data ExampleSet. The Replace Rare Values operator is applied on the name and zip code attributes and replaces all values which occur 10 times or fewer with a new string called "Other."
Figure 1: sample process showing the application of the Replace Rare Values operator on randomly-generated direct mailing data
The operator also provides a pre-processing model which can be used to apply the replacement on unseen data. In addition it can be grouped together with other pre-processing and/or machine learning models with the Group Models Operator.
new Merge operator
The new Merge operator is capable of merging ExampleSets together by appending all attributes to one ExampleSet. Similar to the Append operator, it has an input port extender where you can provide an arbitary set of ExampleSets, Collections of ExampleSets or both. The first Example is merged with the first Examples of all other ExampleSets, the second with the second and so on. Note that this is different than the Join operator (there is no attribute key) which has the advantage of increased speed when merging many (and maybe large) ExampleSets.
Figure 2 shows the merge of three ExampleSets and the resulting ExampleSet:
Figure 2: sample process showing the merge of three ExampleSets and the resulting ExampleSet
Different ExampleSets having attributes with the same name is not allowed in the final merged ExampleSet, however the Merge operator can be configured for how duplicate attribute names are handled. One possibility is that all attributes with the same name, except the first one, can be renamed. Another possibility is to only keep the first of more than one duplicate attribute.
In addition, attributes with "special" roles have to be handled properly because the resulting merged ExampleSet can only have one "special" attribute per role. The default handling is that the first attribute with a special role keeps this role, while all other attributes with this role will be changed to regular attributes. Another option is to keep only the first attribute with a special role in the merged ExampleSet (all other attributes with the same special role in other ExampleSets are ignored). A third option is changing all special attributes to regular ones.
The Merge operator also keeps all annotations of all input ExampleSets. However just like attributes, annotation names must also be unique. For the handling of duplicate annotation names, the Merge operator provides again the two options of either renaming or keeping only the first annotation.
Figure 3 demonstrate the usage of the new Merge operator for ExampleSets where some attribute names and some special roles occur twice:
Figure 3: Merging two ExampleSets where attribute names and special roles occur twice
If the input ExampleSets differ in size, the resulting ExampleSet has the size of the largest input ExampleSet. Attributes which originate from smaller ExampleSets will have missing values for the last examples.
Improvements for the Parametric Probability Estimator operator
This operator uses an assumption for the underlying distribution where we calculate the probability that this assumption is true using a Kolmogorov-Smirnoff test. With this new release, you can now define a threshold on the likelihood of a correct assumption. If the assumption does not hold, attributes are not shown.
The probabilities for the assumption are also provided as a Attribute Weight vector for further uses e.g. with a Select by Weights operator.
- Improved documentation
- Create ExampleSet now uses nominal as its value type for text-input data