Auto model and variables quality

kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
edited June 2019 in Help
Hi there,

What is the logic behind assigning yellow / green status to variables in Auto Model? 
I just came across the situation where variables with higher stability and ID-ness are considered green, while those with lower stability / ID-ness are yellow. I would expect it to be the other way around. 


Tagged:

Best Answer

Answers

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
    Thanks @IngoRM, taking correlation into account makes sense here. However, isn't 0,01% threshold way too low? I mean, if we compare 0,01% correlation (falls into yellow status) and 0,03% which falls into green status -- isn't that difference too subtle to count on? 
  • DocMusherDocMusher Member Posts: 333 Unicorn

    @kypexin @IngoRM, I think this is a good consideration. I noticed a similar "why this color" as I considered some columns as to be important prior to any modeling. I propose to use some standard datasets with the knowledge of domain expertise to demonstrate the impact of following the full logic for all datasets. In other words it would be nice to find examples where some pitfalls could be illustrated.

    It is a question that comes from the audience when the steps of automodel are demonstrated.

    Cheers

    Sven


  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi folks,
    Sure, if you have more data sets to show what works and what does not, we would love to improve the thresholds. 
    > However, isn't 0,01% threshold way too low? I mean, if we compare 0,01% correlation (falls into yellow status) and 0,03% which falls into green status -- isn't that difference too subtle to count on?
    To be honest, both are unlikely to be great predictors and the users can always override this.  Also keep in mind that yellow is still turned on by default.  It is really more a warning sign / hint to look into this while for the green ones there is not much too worry about and keeping them in and letting the ML method deal with it is generally better.
    > It is a question that comes from the audience when the steps of automodel are demonstrated.
    Makes sense.  I would turn it around though and make the point that it is a strength of this approach that we make a recommendation here and keep the user in the loop to make the decision.
    Also I want to make clear that I am not arguing here.  I just wanted to make the point that the traffic lights are a guidance, not more.  Users should always think about those suggestions and take their domain knowledge into account to make the final call.  This is actually why I like this overview table so much.  Out of hundred columns I can quickly focus on the most important ones where human intervention is most justified.
    But again, if you guys have data sets where those recommendations utterly fail, please let us know or share them if possible.  Of course we try to use thresholds which work well for the vast majority of data sets (and we have looked into the values for a couple dozen of data sets already), but the more data sets we consider the better.
    Cheers,
    Ingo
  • DocMusherDocMusher Member Posts: 333 Unicorn
    Dear RM friends,
    Constructive and realistic feedback. Balancing between Automodel and #noblackboxes is essential and "the traffic lights are a guidance" is the answer.
    Cheers
    Sven
Sign In or Register to comment.