Feature Selection

npapan69npapan69 Member Posts: 12 Contributor II
Hi everyone,
It is more than clear that feature selection should take place within the cross-validation operator, in order to avoid leaking the labels if placed outside and prior to the CV operator. My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Thanks in advance

Best Answers

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 394   Unicorn
    Hello @npapan69,

    Answers below:
    It is more than clear that feature selection should take place within the cross-validation operator,
    I don't know what are you referring to. Feature selection can be put elsewhere, even in a different process. It all depends on what are you trying to achieve, in reality. Am I missing something?
    In order to avoid leaking the labels if placed outside and prior to the CV operator.
    Do you mind to share your XML to see what is happening?
    My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
    Now I get it.

    No, feature selection should be done before the cross validation process, not inside the cross validation process. What you are trying to accomplish will lead to certain example subsets having different columns, and a model that is both unpredictable and poorly trained.

    Again, do you mind to share your XML to see what is happening?

    All the best,

    Rodrigo.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 83   Unicorn
    I know this has been sorted out before, so let me dig out the confusion out...
    I think we have two very different problems here:
    1. evaluating a process to arrive at the best model for data;
    2. evaluating the model to be later deployed.
    I think the selected solution is looking at #1 which aims to evaluate the process capable of generating a deployable model. We believe that the resulting model will perform according to the cross-validation, and so quite correctly feature engineering should be inside the cross-validation loop. In fact, in the process of experimenting we are likely to improve the model or the selection of features - we are doing so in response to our validation results, so we are actively using our knowledge of validation results to make improvements (oh no!).
    However, at some point we will create a model using "all" our data and feature engineering will be conducted on "all" data. So what is the performance of that specific model, which uses those specific features, especially that we interfered with the features and the model in this process?!?
    Usually, we reserve yet another data partition for doing just that and call this "honest testing", which is no longer in the optimization / improvement loop. So it means that "all" is a relative term, excluding that "honest testing" data partition. Also if we are dealing with millions of data points, I question the sanity of using all data to train a model, and if indeed we only select a good representative sample for model development, we would be left with a very large data partition for multiple-sample testing, to get a better estimate of performance for this particular model with those specific features.
    Confusing? -- Jacob
    sgenzervarunm1rfuentealba
Sign In or Register to comment.