Feature Selection Part 2: Using the Feature Selection Extension

tftemme · January 2018

Feature Selection - Part 2

I often stumble over the situation when I use every possible feature (i.e. attribute) in my data analysis and then become stumped by what is sometimes called the "curse of dimensionality". Why? Simply put, a large number of features makes it difficult for machine learning models to extract the underlying pattern I am searching for, and hence the performance of the models becomes worse. The solution to this problem is feature selection: choosing the features which contain information about the problem you are working on and removing the ones without any information.

RapidMiner provides you with some great out-of-the-box tools for feature selection, for example weighting algorithm operators such as Weight by Correlation or Weight by Information Gain. Advanced feature selection algorithm operators can also be used in RapidMiner such as Forward Selection and Backward Elimination. An extended overview about the possibilities of feature weighting in RapidMiner can be found in @mschmitz's excellent KB article: Feature Selection Part 1: Feature Weighting.

If you want to go into more detail and investigate the performance of feature selection itself, you may want to try out the dedicated Feature Selection Extension in the RapidMiner Marketplace. In this article I will showcase the usage of the extension and what you can do with it. The processes shown in this article are also attached, so feel free to test them on your own data set.

Note that "attributes" are often called "features" when discussing the topic of feature selection – they are the same thing. For this article I will use with the word "feature" throughout. For more information (and more theoretical background) about feature selection and the Feature Selection Extension for RapidMiner, please have a look at Sangkyun Lee, Benjamin Schow, and Viswanath Sivakumar, Feature Selection for High-Dimensional Data with RapidMiner, Technical Report SFB 876, TU Dortmund, 01/2011

Why Feature Selection?

There are four main reasons to apply feature selection on a machine learning problem:

Less complex models which handle a smaller number of features are easier to interpret and in general more explanatory. Concentrating on the most relevant features enables data scientists to explain the decision making of the models to engineers, managers and users.
A data set with a high number of features also needs a high number of examples to describe its statistic properties. In fact the needed number of examples to comprehensively describe the data set grows exponentially with the number of features.
A larger number of features increases the runtime of both the training and the application of machine learning models. Removing unrelevant features speeds up the runtime of the machine learning algorithms.
A large number of features often comes with a high variance in several of the features. This variance can decrease the stability and performance of the machine learning models trained on this data set.

Feature Selection Methods - Filter and Wrapper Methods

RapidMiner has several operators which can weight the features of your data set. Figure 1 gives a short overview. The weights can be used by the Select by Weights operator to reduce the number of features and only work on the most relevant ones, according to the used weighting method.

Blog Posts_Other_Feature Selection_weight_operators.png Figure 1: Overview over the Feature Weights operators in RapidMiner.

Any method that selects features by weighting is called a "filter method". It has the disadvantage that no feature interaction is taken into account. On the other hand so called "wrapper methods" allow the algorithm to search through all possible subsets of features and select the best performing one. RapidMiner has several wrapper method operators for feature selection such as the Forward Selection and Backward Elimination. These two operators iteratively add or remove features to the subset and train a model with this subset. Model performance is used to determine the best performing feature subset. Note that wrapper methods are typically computationally expensive and have a risk of overfitting.

MRMR - Minimum Redundancy Maximum Relevance

The Feature Selection Extension of RapidMiner offers additional algorithms which bridge the gap between the simple filter and expensive wrapper methods. One example is the Minimum Redundancy Maximum Relevance feature selection algorithm. Its basic idea is that features are iteratively added which are most relevant to the label, and have the least redundancy to previously selected features. The algorithm is implemented by the Select by MRMR / CFS operator in the Feature Selection Extension.

How many attributes to select?

It is crucial to balance the amount of selected features. Removing too many might result in a loss of information, but keeping too many might not result in the desired benefit. The right amount of reduction can be found by investigating the performance and the stability of the feature selection. Figure 2 shows a RapidMiner process in which the stability and the performance of a MRMR feature selection algorithm applied on the Sonar Sample data set is evaluated:

Blog Posts_Other_Feature Selection_process_final.png Figure 2: RapidMiner process to evaluate the stability and performance of the MRMR feature selection algorithm in dependency of the number of selected features

The Loop operator loops over the number of selected features. In each of the 10 iterations, the Ensemble-FS operator is used on a different subset of the input data sample to evaluate the stability (also called robustness) of the feature selection. The 10 selected feature sets are then compared by calculating the so-called Jaccard-Index (here called robustness). This index describes how similar the different subsets are. A robustness value close to 1 indicates that the subsets do not differ much and the feature selection can be considered to be stable.

The features selected by the Ensemble-FS operator are then used to train a random forest model inside the Cross Validation operator. Thus for every group of selected features, we evaluate the corresponding performance of the random forest model. The results of this process are dependent on the robustness of the feature selection from the number of selected features (see Figure 3), and the performance of the feature selection from the number of selected features (see Figure 4):

Blog Posts_Other_Feature Selection_process_1_result_1.png Figure 3: Dependency of the robustness of the MRMR feature selection algorithm from the number of selected features. The robustness improves with the number of selected features. With 30 selected features it reaches a constant level.

Blog Posts_Other_Feature Selection_process_1_result_2.png Figure 4: Dependency of the performance of the MRMR feature selection algorithm from the number of selected features. The performance improves with the number of selected features and stay on a constant level for 10 or more selected features.

Figure 3 shows that the robustness improves when more features are selected. For 30 and more selected features, the robustness reaches a relatively constant level; hence the selection can be considered to be stable in the example for 30 or more features. The performance also increases with the number of selected features, but only when selecting fewer than 10; performance levels off when selecting 10 or more features. Hence in this example analysis of the Sonar Data Set, the number of selected features can be set to 30 to receive a stable feature selection with the highest possible performance.

Automatic Analysis of Feature Weights with RapidMiner

RapidMiner also offers the possibility to automatically analyze the feature weights for the different methods provided in RapidMiner. Figure 5 shows an example process how to do this (N.B. this process uses the Operator Toolbox Extension).

Blog Posts_Other_Feature Selection_process_2_final.png Figure 5: RapidMiner process to automatically analyze the feature weights for 8 different weighting methods

The first Loop operator loops over the different weighting methods. For each method a second Loop performs 30 iterations of the same weighting. The results are appended and the 30 iterations are averaged by the Aggregate operator. Thus the final ExampleSet contains the average and the standard deviation of the weights for every feature and every method.

Inside the second Loop operator, a Sample (Bootstrapping) operator is used to sample a bootstrapped subset of the input data. The Select Subprocess operator is used to select the current weighting method for this iteration. The results for the feature weights for the MRMR feature selection algorithm are shown in Figure 6:

bar graphs.png Figure 6: Averaged feature weights calculated by the MRMR feature selection algorithm

This process shows another trick for investigating feature selection: using the Generate Attributes operator to generate a purely random feature in the input data. This random feature is also included in the weighting algorithm. All other features that receive a weight less than or equal to the random one are considered to provide equal or less information than a purely random feature. This can be used to define a threshold on the feature weights.

Conclusion

Feature selection is an important and advanced topic in a data science project. A good selection can largly improve the performance of your machine learning models and enable you to concentrate on the most relevant features. RapidMiner's Feature Selection Extension enables you to perform advanced methods of feature selection and also dig deep into the stability and the performance of your feature selection processes. Feel free to test it and extract the best out of your data.

Author: @tftemme, January 2018

vishruth_muthya · October 2018

Hi, @tftemme this is a very nice technique and i am trying to implement this on my dataset. Descriptive Statistics: 1600+ Columns 15 Nominal 1600 Numerical In the "Automatic Analysis of Feature Weights with RapidMiner" once we get the multiple weights and their corresponding weights. Now how can i use this data to convert this into an attributeweight vector expected as an input by "Select By Weights operator"

vishruth_muthya · October 2018

Hi, @tftemme this is a very nice technique and i am trying to implement this on my dataset. Descriptive Statistics: 1600+ Columns 15 Nominal 1600 Numerical Should i determine the optimum number of attributes for individual data sets every time. How about considering 30 Variable for all datasets will it be an right approach.

tftemme · October 2018

Hi @vishruth_muthya,

Concerning your first question, you can use the ExampleSet to Weights operator from the Converters extension, to create an attribute weight vector which is used by the Select by Weights operator.

Concerning your second question. The optimum number of attributes depends on your data, hence yes you would need to determine this for every data set. (By the way, the feature selection itself should also be validated, at least by a hold-out set). Nevertheless you can of course start with reducing to 30 variables (maybe due to time issues). It is not wrong, its just maybe not the optimal number.

Best regards
Fabian

Uche · November 2021

tftemme,
Thanks for the very detailed write up!
I would like to know if these methods are implemented to execute in a parallelized manner.
Especially because some of these methods do not scale well with increasing data sizes.

Thank you,

Uche

tftemme · November 2021

Hi @Uche,

As far as I know, the specific feature selection algorithm are not implemented in a parallized manner (I am using here the feature selection extension on the RM Marketplace. I linked resources about it in the starting paragraphs, if you want to check it out in more detail). But the Loop operators in the shared processes are operatoring in parallel, which effectively parallize the processing of the proposed investigations.

If you are still running in runtime problems, it maybe worthwhile to perform a univariate feature selection (for example with the Weighting operators), with a rather loose threshold, to get rid of the more clearly non-helpful features. For example you could use it to reduce datasets to some hundreds of features. And then you can apply the more advanced feature selection methods, mentioned in this post.

Best regards,
Fabian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Feature Selection Part 2: Using the Feature Selection Extension