Binning by entropy -- inner logic

kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn
edited December 2019 in Help
Hi miners, 

I need to understand inner logic of 'Binning by Entropy' operator (however I understand the standalone algorithm itself). It seems to me that in many cases it tries to minimize the final number of bins, which results in maximum 2 bins for most variables in certain datasets. This often might me relevant, however, very often not granular enough.

Think of customer age in credit risk applications. Traditionally, the correlation is such, that the younger the customer, the riskier he is, and with a little upward trend in the oldest age group also. Technically, we can say that 2 bins can be a minimum that works here, but such binning does not take into account the distribution of risk per more granular age groups. If using weight of evidence binning, in many cases we may see distributions like this (here blue trend goes perfectly down throughout age groups, so it easily could be represented by 2 bins minimum):

Do I understand it right that this is how actually the operator works, trying to minimise number of bins? Can there be in the future possibilities and improvements for more control over parameters, like specifying desired minimum number of bins, and so on? 

Also, a side question: anyone ever heard of an implementation of weight of evidence / information value algorithms and binning for RM?

Many thanks.


  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    While I haven't inspected the operator code directly, binning by entropy would typically use an underlying algorithm that adds a penalty for each additional bin to prevent over-specification.  So it is not directly minimizing the number of bins but rather avoiding an excessive number of bins if the additional gain in entropy is not worth it. The help topic for this operator is unfortunately not more explicit about the function used although there are references to two academic papers used that might have more detail.

    By the way, I completely second the idea of getting an operator to calculate WoE or IV and return that explicitly!  That would be quite helpful. There is an operator that I typically use as a proxy because it has high correlation, although it doesn't output the information value directly, you can use the Weight by Information Gain operator to find the relative magnitudes with pretty good reliability, I think.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.