The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
FeatureSelect/RemoveUselessFeatures and Holdout Sets
stereotaxon
Member Posts: 10 Contributor II
Hi,
I have a large number of attributes and I want to use them through various feature selection routines.
My experiment is set up to do cross-validation on a training set and then I read in my holdout set and apply the model to those items.
This doesn't work when I run the feature selection algorithms though. It appears that I have to have the same dataset structure in the holdout dataset as in the training dataset. Since the variables are being deleted automatically, I don't know how to get my datasets to match.
Am I doing this wrong?
Is there a way to read in a train UNION holdout dataset, filter the holdout cases, do my model fitting, then filter the training cases and apply my model?
Thanks for your help.
Mike
I have a large number of attributes and I want to use them through various feature selection routines.
My experiment is set up to do cross-validation on a training set and then I read in my holdout set and apply the model to those items.
This doesn't work when I run the feature selection algorithms though. It appears that I have to have the same dataset structure in the holdout dataset as in the training dataset. Since the variables are being deleted automatically, I don't know how to get my datasets to match.
Am I doing this wrong?
Is there a way to read in a train UNION holdout dataset, filter the holdout cases, do my model fitting, then filter the training cases and apply my model?
Thanks for your help.
Mike
0
Answers
unfortunately I did not quite understand what exactly you want to do and in which order you want to do the steps you mentioned ... Do you want to do a cross-validation on the training set, then a validation on the holdout set? Then why do you want to do the cross-validation at all? Or do you want to incorporate a cross-validation inside the feature selection to determine the best features, learn a model and then test the model on the holdout set? Maybe you can clarify my confusion by posting a sample process XML?
Anyway, addressing the different data structures between the training and the holdout set involving a feature selection during training: every feature selection scheme outputs an AttributeWeights object which holds the information which attributes are selected and which ones are deselected by the feature selection. You may store this AttributeWeights object, load it afterwards and use the AttributeWeightsSelection operator to select the features of the holdout set according to the specification in the AttributeWeights object.
Hope that helps, if not please try to explain your procedure a little bit more detailed or post your process XML.
Regards,
Tobias
Thanks for your help. My goal of all of this is that I (now) just want to use feature selection, fit a model to my the reduced dataset, apply that model to a holdout set, and write the predictions to a file. It's not working though. The problem I'm having is that RapidMiner is using the wrong variables when applying a model after featureSelection. I suspect it's applying by order as opposed to by name?
For example, after feature selection, I have three variables. If I apply a linear regression model that uses all of the variables, the model applier works.
Var1 Var2 Var3 Intercept
Value 0.319 0.406 19.104
Coeff 0.868 -0.824 0.722
V*C 0.277 -0.335 13.787 -9.924 = 3.805 <-- prediction
However, when I use a learner such as W-SimpleLinearRegression that will produce a model with only 1 variable, my predictions are incorrect.
For example, the input values are the same, but coeff[3] and the intercept have changed, so I should get the prediction of 6.517.
Var1 Var2 Var3 Intercept
Value 0.319 0.406 19.104
Coeff 0.930
V*C 17.767 -11.250 = 6.517 <- correct- prediction
but that's not what I'm getting. It seems that RM is using var1's value of .319 instead of var3s value of 19.104 when applying the model.
Var1 Var2 Var3 Intercept
Value 0.319 0.406 19.104
Coeff 0.930
V*C 0.297 -11.250 -10.953 <-- what I'm getting
So, to summarize. FeatureSelection seems to be confusing the model applier, and me. It doesn't seem to be using the right variable when making preditions.
What am I doing wrong?
Thanks,
Mike
Cheers,
Ingo