Bug report : Calibrate (Rescale Confidendes (Logistic)) operator

lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
edited November 2019 in Product Feedback - Resolved
Dear all,

I wanted to report a bug under certain conditions when AutoModel is executing  : 



You can reproduce this error by : 

 - Executing AutoModel with the data in attached file,
 - setting the Classification attribute as the target variable.
 - setting all the options by default in AutoModel,
 

After opening the process and investigations : 

 - The bug is generated by the Calibrate (Rescale Confidences(Logistic)) operator (inside Train Model / Optimize subprocesses) : When this operator is removed (and if also the Split  Data operator is removed), the process works fine.
 - The bug is linked to the Split Ratio  of Train/Test (0.9/0.1). In deed if the ratio is set to 0.8/0.2, the process works fine.
 - The bug seems linked to the one-hot-encoded of the Date attributes. In deed if the Extract Date Information is disabled in AutoModel (and thus AutoModel works with the original attributes), the process works fine.

Maybe a possible solution, if the bug is unavoidable under certain conditions, is to use  the Calibrate operator with a  Handle Exception operator.

Thanks you for your listening,

Regards,

Lionel


0
0 votes

Fixed and Released · Last Updated

Known issue. Thank you for reporting. IC-1650

Comments

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    can it be, that either all of your predictions or all of your labels are of one class? That's what usually causes such an error.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi martin,

    Thanks you for your answer ! 
    No, it is not the case for me : 
    Here the distributions of values of the label for both training set and test set entering in the Calibrate(Rescale Confidences(Logistic) operator.





    On the other hand, these 2 example sets have no "predictions" column...


    Regards,

    Lionel


  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    The problem is that one of the classes (in this case "Loss Goal Scored") has hardly any examples.  The logistic regression for the calibration is performing a one-vs-all approach and after validation splits etc. there are probably no examples left for that class.  Which then results in the situation of only one class "NOT Loss Goal Scored" which leads to the error.
    This error is indeed annoying, mainly because it is soooooo unnecessary.  If there is only one class?  Fine, then always predict that one with confidence 1.  Done.  This is what the Log Reg SHOULD be doing.  Instead, it tries to be "smart" and throws an error instead.
    "Why don't we fix it simply then?" you ask?  Good question.  Because we do not actually throw it.  The error is coming from the H2O library we are using to build the log reg model.  And while we in general love the lib, this "smartness" drives me nuts...  There is no good way around it.  We cannot change the library.  And all ways to capture it on our side are pretty much a hack as well...
    Anyway, if this issue becomes bigger / more frequent, we probably need to go down the hack route.  For now, the only advice I can give is to filter out those rare classes if possible.
    Sorry for the inconvenience,
    Ingo
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi Ingo,

    Thanks for your answer.
    Ok, I understand now the problem and your position.

    For now, the only advice I can give is to filter out those rare classes if possible

    What do you think about this alternative strategy to handle "rare classes" :  
    Use the Replace Rare Values operator to"group" the "rare classes" into a bigger class. It avoids to "lose" the informations contained in the rare values :  

    Here a (fictive) example of such strategy :  



    Regards,

    Lionel
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Yes, you could that before you start modeling (it may be a good idea anyway to help getting a better model).  We cannot really do this automatically though (like group too small groups together until we hit a reasonable minimum number) since this would maybe not the best option from a use case perspective.
    Hope this helps,
    Ingo
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    edited November 2019
    Dear all,

    I just wanted to report this bug with an other dataset. But in this case, it is binary balanced label (there is no rare values in the label) : 



    You can also notice that, in this dataset, the polynominal regular attributes are imbalanced but NOT highly imbalanced...

    The error occurs with the Naive Bayes model and you have to enable FEATURE SELECTION and FEATURE GENERATION in AutoModel.

    Regards,

    Lionel

    EDIT : I forgot to attach the data...

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Thanks, that indeed looks suspicious then.  We will have a look into this asap.
  • tkeneztkenez Employee, RapidMiner Certified Expert, Member Posts: 22 RM Product Management
    Hey all,

    I want to provide an update here: with the soon-to-be released 9.7 version of the product, this issue will be fixed.
    Some background: the H2O model used in the AutoModel process couldn't handle classification problems with only one class in the data. In newer versions of H2O, there's a parameter which we can use to override this behavior and make it work nicer with how AutoModel is built.
    As we updated H2O to the latest stable version, we could now set this properly.

    @lionelderkrikor I encourage you to give this a go with 9.7 and tell us how it's going.

    Regards,
    Tamas

Sign In or Register to comment.