Feature Selection Part 2: Using the Feature Selection Extension
Feature Selection - Part 2
I often stumble over the situation when I use every possible feature (i.e. attribute) in my data analysis and then become stumped by what is sometimes called the "curse of dimensionality". Why? Simply put, a large number of features makes it difficult for machine learning models to extract the underlying pattern I am searching for, and hence the performance of the models becomes worse. The solution to this problem is feature selection: choosing the features which contain information about the problem you are working on and removing the ones without any information.
RapidMiner provides you with some great out-of-the-box tools for feature selection, for example weighting algorithm operators such as Weight by Correlation or Weight by Information Gain. Advanced feature selection algorithm operators can also be used in RapidMiner such as Forward Selection and Backward Elimination. An extended overview about the possibilities of feature weighting in RapidMiner can be found in @mschmitz's excellent KB article: Feature Selection Part 1: Feature Weighting.
If you want to go into more detail and investigate the performance of feature selection itself, you may want to try out the dedicated Feature Selection Extension in the RapidMiner Marketplace. In this article I will showcase the usage of the extension and what you can do with it. The processes shown in this article are also attached, so feel free to test them on your own data set.
Note that "attributes" are often called "features" when discussing the topic of feature selection – they are the same thing. For this article I will use with the word "feature" throughout. For more information (and more theoretical background) about feature selection and the Feature Selection Extension for RapidMiner, please have a look at Sangkyun Lee, Benjamin Schow, and Viswanath Sivakumar, Feature Selection for High-Dimensional Data with RapidMiner, Technical Report SFB 876, TU Dortmund, 01/2011
Why Feature Selection?
There are four main reasons to apply feature selection on a machine learning problem:
- Less complex models which handle a smaller number of features are easier to interpret and in general more explanatory. Concentrating on the most relevant features enables data scientists to explain the decision making of the models to engineers, managers and users.
- A data set with a high number of features also needs a high number of examples to describe its statistic properties. In fact the needed number of examples to comprehensively describe the data set grows exponentially with the number of features.
- A larger number of features increases the runtime of both the training and the application of machine learning models. Removing unrelevant features speeds up the runtime of the machine learning algorithms.
- A large number of features often comes with a high variance in several of the features. This variance can decrease the stability and performance of the machine learning models trained on this data set.
Feature Selection Methods - Filter and Wrapper Methods
RapidMiner has several operators which can weight the features of your data set. Figure 1 gives a short overview. The weights can be used by the Select by Weights operator to reduce the number of features and only work on the most relevant ones, according to the used weighting method.
Figure 1: Overview over the Feature Weights operators in RapidMiner.
Any method that selects features by weighting is called a "filter method". It has the disadvantage that no feature interaction is taken into account. On the other hand so called "wrapper methods" allow the algorithm to search through all possible subsets of features and select the best performing one. RapidMiner has several wrapper method operators for feature selection such as the Forward Selection and Backward Elimination. These two operators iteratively add or remove features to the subset and train a model with this subset. Model performance is used to determine the best performing feature subset. Note that wrapper methods are typically computationally expensive and have a risk of overfitting.
MRMR - Minimum Redundancy Maximum Relevance
The Feature Selection Extension of RapidMiner offers additional algorithms which bridge the gap between the simple filter and expensive wrapper methods. One example is the Minimum Redundancy Maximum Relevance feature selection algorithm. Its basic idea is that features are iteratively added which are most relevant to the label, and have the least redundancy to previously selected features. The algorithm is implemented by the Select by MRMR / CFS operator in the Feature Selection Extension.
How many attributes to select?
It is crucial to balance the amount of selected features. Removing too many might result in a loss of information, but keeping too many might not result in the desired benefit. The right amount of reduction can be found by investigating the performance and the stability of the feature selection. Figure 2 shows a RapidMiner process in which the stability and the performance of a MRMR feature selection algorithm applied on the Sonar Sample data set is evaluated:
Figure 2: RapidMiner process to evaluate the stability and performance of the MRMR feature selection algorithm in dependency of the number of selected features
The Loop operator loops over the number of selected features. In each of the 10 iterations, the Ensemble-FS operator is used on a different subset of the input data sample to evaluate the stability (also called robustness) of the feature selection. The 10 selected feature sets are then compared by calculating the so-called Jaccard-Index (here called robustness). This index describes how similar the different subsets are. A robustness value close to 1 indicates that the subsets do not differ much and the feature selection can be considered to be stable.
The features selected by the Ensemble-FS operator are then used to train a random forest model inside the Cross Validation operator. Thus for every group of selected features, we evaluate the corresponding performance of the random forest model. The results of this process are dependent on the robustness of the feature selection from the number of selected features (see Figure 3), and the performance of the feature selection from the number of selected features (see Figure 4):
Figure 3: Dependency of the robustness of the MRMR feature selection algorithm from the number of selected features. The robustness improves with the number of selected features. With 30 selected features it reaches a constant level.
Figure 4: Dependency of the performance of the MRMR feature selection algorithm from the number of selected features. The performance improves with the number of selected features and stay on a constant level for 10 or more selected features.
Figure 3 shows that the robustness improves when more features are selected. For 30 and more selected features, the robustness reaches a relatively constant level; hence the selection can be considered to be stable in the example for 30 or more features. The performance also increases with the number of selected features, but only when selecting fewer than 10; performance levels off when selecting 10 or more features. Hence in this example analysis of the Sonar Data Set, the number of selected features can be set to 30 to receive a stable feature selection with the highest possible performance.
Automatic Analysis of Feature Weights with RapidMiner
RapidMiner also offers the possibility to automatically analyze the feature weights for the different methods provided in RapidMiner. Figure 5 shows an example process how to do this (N.B. this process uses the Operator Toolbox Extension).
Figure 5: RapidMiner process to automatically analyze the feature weights for 8 different weighting methods
The first Loop operator loops over the different weighting methods. For each method a second Loop performs 30 iterations of the same weighting. The results are appended and the 30 iterations are averaged by the Aggregate operator. Thus the final ExampleSet contains the average and the standard deviation of the weights for every feature and every method.
Inside the second Loop operator, a Sample (Bootstrapping) operator is used to sample a bootstrapped subset of the input data. The Select Subprocess operator is used to select the current weighting method for this iteration. The results for the feature weights for the MRMR feature selection algorithm are shown in Figure 6:
Figure 6: Averaged feature weights calculated by the MRMR feature selection algorithm
This process shows another trick for investigating feature selection: using the Generate Attributes operator to generate a purely random feature in the input data. This random feature is also included in the weighting algorithm. All other features that receive a weight less than or equal to the random one are considered to provide equal or less information than a purely random feature. This can be used to define a threshold on the feature weights.
Feature selection is an important and advanced topic in a data science project. A good selection can largly improve the performance of your machine learning models and enable you to concentrate on the most relevant features. RapidMiner's Feature Selection Extension enables you to perform advanced methods of feature selection and also dig deep into the stability and the performance of your feature selection processes. Feel free to test it and extract the best out of your data.
Author: @tftemme, January 2018