Options

Information Gain Vs Gain Ratio

orawareoraware Member Posts: 2 Contributor I
edited November 2018 in Help
Hi Guys,

Going through with classification decision tree model using rapid miner, stuck with an experiment for information gain and gain ratio calculation, after reading following descriptions.

Information gain : It works fine for most cases, unless you have a few variables that have a large number of values (or classes).
Information gain is biased towards choosing attributes with a large number of values as root nodes.

Gain ratio : This is a modification of information gain that reduces its bias and is usually the best option. Gain ratio overcomes
the problem with information gain by taking into account the number of branches that would result before making the split.
It corrects information gain by taking the intrinsic information of a split into account.

When i use rapid miner operator "Weight by Information gain ratio" to calculate following sample data , it caluclates gain ratio for Outlook is quite different to my manual calculation- as below.
Sno	Outlook	 Play
----    -------- ----------
A1 overcast Dont Play
B2 overcast Play
C3 rain Play
D4 rain Play
E5 rain Play

Following are my calculations for Gain ratio

Entropy for Outlook

H (Outlook) : Overcast
              -1/2 log2 (1/2)-1/2 log2 (1/2)
                      -0.5 (-1) - 0.5 (-1)
H (Outlook) :  1


H (Outlook) : Rain
                      -3/3 log 2 (3/3)
                      -1 (0)
H (Outlook) :  0
-----------------------------------------------------------------------
Information Gain for outlook

I (Outlook) = 2/5*(1)+3/5 * (0)
    =0.4
-----------------------------------------------------------------------
Entropy for Sno attribute

H (Sno) : A1
H (A1)= -1/5 log2(1/5)
0.0464


H (Sno) : B2
H (B2)= -1/5 log2(1/5)
0.0464

H (Sno) : C3
H (C3)= -1/5 log2(1/5)
0.0464

H (Sno) : D4
H (D4)= -1/5 log2(1/5)
0.0464

Hence
H(E5) = 0.0464
------------------------------------------------------------------------------
Information Gain for Sno attribute

I (Sno)
=1/1*log2(1/1)+1/1*log2(1/1)+1/1*log2(1/1)+1/1*log2(1/1)+1/1* log2(1/1)
=0
------------------------------------------------------------------------------
I (Outlook , no partition)

I(Outlook,no partition) =-1/5log2 (1/5)-4/5 log2 (4/5)
                    =-0.2*(-2.32192809)-0.8(-0.321928095)
                =0.464385618+0.257542476
                      =0.72
-----------------------------------------------------------------------------
Entropy before - Entropy After for Outlook

I (Outlook ,no partition)-I (Outlook)=0.72-0.4
                                      =0.32

Entropy before - Entropy After for Sno

I (Outlook ,no partition)-I (Outlook)=0.72-0
                                      =0.72

------------------------------------------------------------------------------

Gain Ratio :

Intrinsic information 5*(-1/5*log2(1/5))
                5*(-0.2(-2.32))
                5*(0.464)
                2.32

Gain Ratio (Outlook)= I (Outlook)/Intrinsic information
          = 0.32/2.32
          = 0.13

Gain Ratio (Sno) = I (Sno)/Intrinsic information
          = 0.72/2.32
          = 0.31
Above manual "Gain Ratio (Sno) 0.31" calculated value matching to rapid miner "Gain Ratio (Sno)  0.310917507 ~ 0.31" calculation-as below, but above manual "Gain Ratio (Outlook)  0.13" is not matching to rapid miner "Gain Ratio (Outlook) 0.331559707 ~ 0.33" calculations

Rapid Miner Gain ratio calculation

Sno 0.310917507 ~ 0.31
Outlook 0.331559707 ~ 0.33
Why it so ? i am using "Weight by Information gain ratio" operator in rapid miner.

Thanks
Sid
Sign In or Register to comment.