Feature Selection within CV: Which features are finally selected?

npapan69npapan69 Member Posts: 17 Maven
edited July 2019 in Help
Dear All,
Coming back to a topic that was attempted to be answered in the past, but as far as I'm concerned I didn't got  a clear answer. Lets consider that we have 20 features A1, A2, A3,... A20 and we perform LASSO (optimizing lambda, and having alpha=1) with a LogReg model, and we do that according to the suggested best practices to reduce accidental exposure of the labels, within a K-fold CV operator. This is done K+1 times, K times for each individual fold and 1 time considering the total data set (that means that there is no data splitting into train+test in that case). And lets assume that for each fold the features with non-zero coefficients are different (A1, A3 and A5 for K=1, A2, A3, A20, for K=2, .... A5, A12, A15 for the whole data set). The final model is using the features that were selected when considering the total data set? If yes then this model performance is not corresponding to the output of the CV operator that averages the performance across all folds. Is that correct?
Many thanks in advance,

Best Answers

  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited July 2019 Solution Accepted
    Hello @npapan69

    Yes, the final model is trained on whole data set and feature selection is also done based on whole data. CV is to check model performance in all scenarios based on changes in different data point. If you really want to test the final fully trained model performance, you can set aside a hold out data set and apply the cross validation output to check how the model is performing on holdout data set.

    Hope this helps

    Be Safe. Follow precautions and Maintain Social Distancing



  • npapan69npapan69 Member Posts: 17 Maven
    Thank you varunm1 for your fast response, regarding lambda optimization when you have an Optimize operator and inside you have your CV operator, the optimization is done again for the final model? or for each fold seperately and if yes which is the optimum lambda, since each fold might have a different lambda?
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    The model is rebuilt completely on the whole data set.  The point of cross-validation is NOT to create models / shortcuts / optimizations etc. but only to estimate how well a model built on the data will perform on unseen data points.  Please check the last paragraph in this article for a bit of discussion on this: https://community.rapidminer.com/discussion/55112/cross-validation-and-its-outputs-in-rm-studio
    Hope this helps,
  • npapan69npapan69 Member Posts: 17 Maven
    Dear IngoRM,
    Now I'm confused, can you elaborate more on the concept of using all the data to build a model including feature selection (for example LASSO)? What I mean is that having done the feature selection prior or outside the CV operator leads to accidental leakage of the labels and therefore overoptimistic performance, but what about doing the feature selection inside the CV if the final model is builded using all data? What happens then with leaking the labels for selecting the features if you use all the data and not separating training and testing like you do for each fold? 
  • npapan69npapan69 Member Posts: 17 Maven
    Fantastic, now I get it. million thanks Ingo
Sign In or Register to comment.