Should I put OptimizeParameters inside XVal ?

AxelAxel Member Posts: 19 Maven
edited November 2018 in Help
Hi everybody,

I'm doing a SVM classification (inside an XValidation loop) and optimize my kernel parameters with "Optimize Parameters".
I'm doing the SVM classification inside XValidation to avoid overfitting of my SVM model, but the Optimize Parameters operator (which sits on top of it) simply iterates over all parameter combinations and returns the best.
Does this not lead to overfitting of the kernel parameters ?  So, should I use OptimizeParameters inside another XValidation ?

I'm asking because the results I get with RapidMiner are always slightly better than the results of the software  DTREG. Now, DTREG is doing the parameter optimization inside a separate cross validation loop and so I wonder if I should do the same in RapidMiner.

Many thanks,

Axel

Answers

  • dragoljubdragoljub Member Posts: 241 Contributor II
    The point of cross validation is to evaluate how general a model you are making. Models are tied to the specific training data and parameters you select prior to training. You should not optimize parameters within X-Validation because you are only finding good parameters for that specific subset of data.

    Instead, perform parameter search to get the best accuracy on ALL your data, then perform X-Validation say 10 fold depending on how much data you have to see how your model generalizes to unseen data.  ;D

    -Gagi
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    in fact there are to sides to take into account: As Gagi said, if you want to find the best parameters, you have to use the complete data, and hence the setup you are having right now.
    But, this is the second side, you have to keep in mind, that you might have overfitted the parameters to your training data and hence the resulting performance might be to optimistic. To check this, you should put the Optimize parameters into another XValidation, this will give you rather pessimistic results, because you didn't use all the data. The difference between the performances gives you an impression of the reliability of the performance of the optimized parameters.
    After all this, you can train the model on the complete data set using the best found parameters, this then is the best possible model.

    Greetings,
      Sebastian
Sign In or Register to comment.