Feature Selection

npapan69 · May 2019

Hi everyone,
It is more than clear that feature selection should take place within the cross-validation operator, in order to avoid leaking the labels if placed outside and prior to the CV operator. My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Thanks in advance

varunm1 · May 2019

Hello @npapan69

The feature selection technique inside cross validation operator is to generalize results by reducing bias. Yes, as you mentioned there might be 5 different models (in case of 5 fold) with 5 different feature sets built in CV as you are using feature selection inside cross validation operator. The "mod" output of cross-validation in RapidMiner gives you a model trained on the whole input dataset, this means the model you are getting might be different from all the 5 models created during cross-validation.

rfuentealba · May 2019

Hello, again!

Stop! Stop! Stop! Don't make that answer the right one! (My pride says "delete your answer", but my OCD says "leave it there").

TIL that it is more common if we put feature selection inside the cross validation process because otherwise it would lead to biased results. Thanks to @varunm1 for the several links he has sent me. I actually got confused (too many hours programming stuff, you know) but this article got clarity for me: https://rapidminer.com/blog/learn-right-way-validate-models-part-4-accidental-contamination/.

Despite my lapsus (and understanding the question), I can now focus on this:

My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?

Let's see:

On each cross validation fold, the selected features will differ.
In the RapidMiner documentation for the Cross Validation operator, it says:

Also the number of iterations that will take place is the same as the number of folds. If the model output port is connected, the Training subprocess is repeated one more time with all Examples to build the final model.

So, the correct answer is: You get the model trained with all the data (not a specific fold), but only if you connect the mod port somewhere else. The model trained will use the best features found for all of these, though.

All the best,

Rodrigo.

rfuentealba · May 2019

Hello @npapan69,

Answers below:

It is more than clear that feature selection should take place within the cross-validation operator,

I don't know what are you referring to. Feature selection can be put elsewhere, even in a different process. It all depends on what are you trying to achieve, in reality. Am I missing something?

In order to avoid leaking the labels if placed outside and prior to the CV operator.

Do you mind to share your XML to see what is happening?

My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?

Now I get it.

No, feature selection should be done before the cross validation process, not inside the cross validation process. What you are trying to accomplish will lead to certain example subsets having different columns, and a model that is both unpredictable and poorly trained.

Again, do you mind to share your XML to see what is happening?

All the best,

Rodrigo.

jacobcybulski · June 2019

I know this has been sorted out before, so let me dig out the confusion out...

I think we have two very different problems here:

evaluating a process to arrive at the best model for data;
evaluating the model to be later deployed.

I think the selected solution is looking at #1 which aims to evaluate the process capable of generating a deployable model. We believe that the resulting model will perform according to the cross-validation, and so quite correctly feature engineering should be inside the cross-validation loop. In fact, in the process of experimenting we are likely to improve the model or the selection of features - we are doing so in response to our validation results, so we are actively using our knowledge of validation results to make improvements (oh no!).

However, at some point we will create a model using "all" our data and feature engineering will be conducted on "all" data. So what is the performance of that specific model, which uses those specific features, especially that we interfered with the features and the model in this process?!?

Usually, we reserve yet another data partition for doing just that and call this "honest testing", which is no longer in the optimization / improvement loop. So it means that "all" is a relative term, excluding that "honest testing" data partition. Also if we are dealing with millions of data points, I question the sanity of using all data to train a model, and if indeed we only select a good representative sample for model development, we would be left with a very large data partition for multiple-sample testing, to get a better estimate of performance for this particular model with those specific features.

Confusing? -- Jacob

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Feature Selection

Best Answers

Be Safe. Follow precautions and Maintain Social Distancing

Answers