Options

[SOLVED] Feature selection WITHIN x-validation...

kasper2304kasper2304 Member Posts: 28 Contributor II
edited November 2018 in Help
I have been given the advice that one need to carry out whatever feature selection step one want to do, WITHIN the x-validation node. I am currently using SVM weight to filter out weights that are less important in order to optimize my model and make the algorithm faster. Until now i just put the "Weight by SVM" node together with a "Select by weight" node outside the X-validation node but within the "Optimize paramteter" node, and then optimized on number of features and C parameter.

If i understand the advice correctly i need to put the nodes within the cross validation node, which i do not really understand...?

The case is that i have a highly unbalanced dataset so i have to extract many features as possible, in order not so sort out any which might be relevant to the small number of positive cases i have (i am doing a text mining model classification model).

Any reflections would be appreciated...

Kasper

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    to get correct estimates it is true that you should add all preprocesses that rely on the data to the X-Validation. Otherwise you would be using information from the complete data set to do the feature selection, where in reality you should only use information from the training set.

    Before going into the details, which part do you not understand? The actual need to do the feature selection in the cross validation, or how to do it with RapidMiner?

    Best regards,
    Marius
  • Options
    kasper2304kasper2304 Member Posts: 28 Contributor II
    Hi Marius.

    Once again thanks for your help.

    I think you already answered my question. The thing i did not realize was that when you keep the feature selection part outside the x-validation node you use weights/features based on the entire dataset. When you put it inside you x-validation node you use weights from your trainingset and iterate over that. That makes sense for me now.

    However it does make the process very slow (optimizing on both C with SVM and some feature selection technique) so doing a "rough" feature selection outside the x-validation would be necessary. I guess it makes sense to use weight by SVM one are using SVM....?

    Just for you notice as you must know me and my project by now,  i had a go with SVD and Latent semantic indexing. As i understand these techniques are closely related and SVD gave reasonably results. The latent semantic indexing node i could not get to work. It gave me an error message saying something about "Attributes cannot be same names...". There is not many threads about Latent semantic indexing in here so not much help to get, but i will post that in another post.

    Best
    Kasper
Sign In or Register to comment.