I often stumble over the situation when I use every possible feature (i.e. attribute) in my data analysis and then become stumped by what is sometimes called the "curse of dimensionality". Why? Simply put, a large number of features makes it difficult for machine learning models to extract the underlying pattern I am searching for, and hence the performance of the models becomes worse. The solution to this problem is feature selection: choosing the features which contain information about the problem you are working on and removing the ones without any information.
RapidMiner provides you with some great out-of-the-box tools for feature selection, for example weighting algorithm operators such as Weight by Correlation or Weight by Information Gain. Advanced feature selection algorithm operators can also be used in RapidMiner such as Forward Selection and Backward Elimination. An extended overview about the possibilities of feature weighting in RapidMiner can be found in @mschmitz's excellent KB article: Feature Selection Part 1: Feature Weighting.
If you want to go into more detail and investigate the performance of feature selection itself, you may want to try out the dedicated Feature Selection Extension in the RapidMiner Marketplace. In this article I will showcase the usage of the extension and what you can do with it. The processes shown in this article are also attached, so feel free to test them on your own data set.
Note that "attributes" are often called "features" when discussing the topic of feature selection – they are the same thing. For this article I will use with the word "feature" throughout. For more information (and more theoretical background) about feature selection and the Feature Selection Extension for RapidMiner, please have a look at Sangkyun Lee, Benjamin Schow, and Viswanath Sivakumar, Feature Selection for High-Dimensional Data w...
There are four main reasons to apply feature selection on a machine learning problem:
RapidMiner has several operators which can weight the features of your data set. Figure 1 gives a short overview. The weights can be used by the Select by Weights operator to reduce the number of features and only work on the most relevant ones, according to the used weighting method.
Any method that selects features by weighting is called a "filter method". It has the disadvantage that no feature interaction is taken into account. On the other hand so called "wrapper methods" allow the algorithm to search through all possible subsets of features and select the best performing one. RapidMiner has several wrapper method operators for feature selection such as the Forward Selection and Backward Elimination. These two operators iteratively add or remove features to the subset and train a model with this subset. Model performance is used to determine the best performing feature subset. Note that wrapper methods are typically computationally expensive and have a risk of overfitting.
The Feature Selection Extension of RapidMiner offers additional algorithms which bridge the gap between the simple filter and expensive wrapper methods. One example is the Minimum Redundancy Maximum Relevance feature selection algorithm. Its basic idea is that features are iteratively added which are most relevant to the label, and have the least redundancy to previously selected features. The algorithm is implemented by the Select by MRMR / CFS operator in the Feature Selection Extension.
It is crucial to balance the amount of selected features. Removing too many might result in a loss of information, but keeping too many might not result in the desired benefit. The right amount of reduction can be found by investigating the performance and the stability of the feature selection. Figure 2 shows a RapidMiner process in which the stability and the performance of a MRMR feature selection algorithm applied on the Sonar Sample data set is evaluated:
The Loop operator loops over the number of selected features. In each of the 10 iterations, the Ensemble-FS operator is used on a different subset of the input data sample to evaluate the stability (also called robustness) of the feature selection. The 10 selected feature sets are then compared by calculating the so-called Jaccard-Index (here called robustness). This index describes how similar the different subsets are. A robustness value close to 1 indicates that the subsets do not differ much and the feature selection can be considered to be stable.
The features selected by the Ensemble-FS operator are then used to train a random forest model inside the Cross Validation operator. Thus for every group of selected features, we evaluate the corresponding performance of the random forest model. The results of this process are dependent on the robustness of the feature selection from the number of selected features (see Figure 3), and the performance of the feature selection from the number of selected features (see Figure 4):
Figure 3 shows that the robustness improves when more features are selected. For 30 and more selected features, the robustness reaches a relatively constant level; hence the selection can be considered to be stable in the example for 30 or more features. The performance also increases with the number of selected features, but only when selecting fewer than 10; performance levels off when selecting 10 or more features. Hence in this example analysis of the Sonar Data Set, the number of selected features can be set to 30 to receive a stable feature selection with the highest possible performance.
RapidMiner also offers the possibility to automatically analyze the feature weights for the different methods provided in RapidMiner. Figure 5 shows an example process how to do this (N.B. this process uses the Operator Toolbox Extension).
The first Loop operator loops over the different weighting methods. For each method a second Loop performs 30 iterations of the same weighting. The results are appended and the 30 iterations are averaged by the Aggregate operator. Thus the final ExampleSet contains the average and the standard deviation of the weights for every feature and every method.
Inside the second Loop operator, a Sample (Bootstrapping) operator is used to sample a bootstrapped subset of the input data. The Select Subprocess operator is used to select the current weighting method for this iteration. The results for the feature weights for the MRMR feature selection algorithm are shown in Figure 6:
This process shows another trick for investigating feature selection: using the Generate Attributes operator to generate a purely random feature in the input data. This random feature is also included in the weighting algorithm. All other features that receive a weight less than or equal to the random one are considered to provide equal or less information than a purely random feature. This can be used to define a threshold on the feature weights.
Feature selection is an important and advanced topic in a data science project. A good selection can largly improve the performance of your machine learning models and enable you to concentrate on the most relevant features. RapidMiner's Feature Selection Extension enables you to perform advanced methods of feature selection and also dig deep into the stability and the performance of your feature selection processes. Feel free to test it and extract the best out of your data.
Author: @tftemme, January 2018