How do I use Optmize Parameters to find a seed where the minimum class precision is maximized?

ZaituneZaitune Member Posts: 1 Newbie
I was using the Cross-Validation operator to create a  Gradient Boosted Trees model on a small database (419 examples). I had 5 different classes and I wanted to find a seed where the class precision percentage was best distributed.
For example, in the image above I had a good average accuracy, but the class precision percetange was not evenly distributed, 90.29% for Class 2 but 78.57% for Class 3, an 11.72% difference. When changing the seeds i found better distributions, so I decided to use the Optmize Parameters operator to find a seed where the class with the minimum precision had the highest percentage when compared to other seeds.
 
However I cant really figure out how to make the operator look for this specific optmization parameter, is there even a way to do so? Or is there a better method to find a good class precision distribution?
 
I'm very new to everything related to machine learning and data mining... but I need to develop a model for a project on a very tight schedule and this may not be the most effective way to do what I want, so im open to any new ideas.

Best Answers

  • rjones13rjones13 Member Posts: 124 Unicorn
    Solution Accepted
    Hi @Zaitune,

    Have I understood correctly that you're trying to to optimize the value of "local random seed" in the Cross Validation operator? In this case, we wouldn't recommend this as good practice. All this parameter is doing is randomizing the splitting in the cross validation, and optimizing that value for results might give artificially good results.

    If you want some help on how to set up an optimization, there should be an example bundled with the install. If you look on the help panel for Optimize Parameters (Grid), there's a link to optimizing a SVM model.


    In general, there's two main approaches we can take to improving our model performance.
    1. Improve the model. You could do this by trying different model types and optimizing the parameters of the model.
    2. Improve the data. This is a topic called feature engineering, where we can select the optimal subset of variables, or generate derived variables with better predictive power from the original set.
    Here's a few suggested starting videos from the RapidMiner Academy:
    I hope this helps. Any questions, please do post again.

    Best,

    Roland
    BalazsBaranyZaitune
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi!

    Changing the random seed for distributing the examples between the test and training sets only changes the validation result but it doesn't have any relation to the real-world performance. This is just a fancy way of lying to oneself.

    As explained by Roland, a better model is robust in delivering good performance. With your multiclass classification problem and more or less balanced dataset you can optimize for accuracy - a more accurate model will also maximize the precision of most classes.

    If you have an important class or one with a very low precision compared to others you could add a "weight" column to your data and assign a higher weight to that class. (Then use Set Role to give the role "weight" to that column.) This will change some models to try to harder to correctly predict that class - but this might make the precision worse for other classes.

    Regards,

    Balázs


    rjones13Zaitune
Sign In or Register to comment.