Options

How to get best out of n models?

kaymankayman Member Posts: 662 Unicorn
edited November 2018 in Help

I've been training text data for classification and have 3 different models for now (SVM / LDA and Bayes) and while all 3 off them give me the same results in average there are some noticeable differences in areas where the model 'doubts' the right label to predict.

 

So I'd like to combine the actual output of all 3 of them (or even more in future) to come with a kind of 'best out of 3 solutions'

 

So if all of my models predict label_x for a given record this is an obvious winner

if 2 out of 3 predict label_x this should be the final one

If all 3 predict a different label it needs more attention / be skipped

 

Are there operators that can do this thing? For now I have a relative complex setup that does this for me also, but if there is something more structured and out of the box available it would be handy.

Best Answer

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Yes, I see the difference.  I don't think there is anything out of the box to do what you want with already trained models.  Although it should be easy enough to replicate those original models with the same parameters inside the ensemble model operators, assuming you built them originally in RapidMiner.  

    If you are committed to using your earlier models, than a manual approach with Apply Model and Generate Attributes (and maybe some looping) is probably your best bet---and it sounds like you might already be doing that.

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Of course, this is a textbook case for ensemble modeling.  There a multiple operators available in RapidMiner to handle this type of situation. Based on your description, your two main options here are voting and stacking. 

    Voting will quite simply use all the separate models independently and make the final determination based on majority votes (so it is helpful to have an odd number of models).  This is not a weighted vote, though, so there is some loss of precision involved.  You could separately use Generate Attributes to create a weighted vote if you want.

    The other option you have is Stacking, which actually uses one top-level ML algorithm to decide which individual models to use in different examples based on overall performance.  This is often done with a decision tree learner, although more complex schemes are also feasible.

    Both of these are available in the base operators for RapidMiner and have tutorials if you want to see the setup.  I'd encourage you to try them both out and see which one works better for your use case.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    earmijoearmijo Member Posts: 270 Unicorn

    And to deal with this part:

     

    "If all 3 predict a different label it needs more attention / be skipped"

     

    You could use the operator "Drop Uncertain Predictions". This operator will issue a "?" for confidences that fall below a user-supplied threshold. 

  • Options
    kaymankayman Member Posts: 662 Unicorn

    Thanks Brian,

     

    I looked at these earlier, but unless I am missing something the out of the box solutions (both vote and stacking) are meant to be used during the training process, while I have already 3 dedicated and trained models available. So in the end using these I would have one model using some mix based on my training data. It remains an option of course, but at this stage I'd prefer to keep my seperate models as it is a bit easier to tune the preparation by model.

    So is there a way to do something similar, but with already pretrained models ? So my unlabeled data would use the 3 saved models, and then see which label is provided the most? I have build a process doing exactly this, but if there are operators doing this more streamlined that would be nice. 

  • Options
    kaymankayman Member Posts: 662 Unicorn

    True, but it doesn't work for all models, as for instance SVM will provide either 1 or 0 for a given predicted label. And as I do not want to loose records in the process (since I join now 3 different model predictions in one set) I would have to assign a new label (like undefined or so) for these. It's something I could add in a later stage but first I'd like to undrstand if there are operators making it easier so I can simplify my current logic.

Sign In or Register to comment.