Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Information Gain Vs Gain Ratio
Hi Guys,
Going through with classification decision tree model using rapid miner, stuck with an experiment for information gain and gain ratio calculation, after reading following descriptions.
Information gain : It works fine for most cases, unless you have a few variables that have a large number of values (or classes).
Information gain is biased towards choosing attributes with a large number of values as root nodes.
Gain ratio : This is a modification of information gain that reduces its bias and is usually the best option. Gain ratio overcomes
the problem with information gain by taking into account the number of branches that would result before making the split.
It corrects information gain by taking the intrinsic information of a split into account.
When i use rapid miner operator "Weight by Information gain ratio" to calculate following sample data , it caluclates gain ratio for Outlook is quite different to my manual calculation- as below.
Rapid Miner Gain ratio calculation
Thanks
Sid
Going through with classification decision tree model using rapid miner, stuck with an experiment for information gain and gain ratio calculation, after reading following descriptions.
Information gain : It works fine for most cases, unless you have a few variables that have a large number of values (or classes).
Information gain is biased towards choosing attributes with a large number of values as root nodes.
Gain ratio : This is a modification of information gain that reduces its bias and is usually the best option. Gain ratio overcomes
the problem with information gain by taking into account the number of branches that would result before making the split.
It corrects information gain by taking the intrinsic information of a split into account.
When i use rapid miner operator "Weight by Information gain ratio" to calculate following sample data , it caluclates gain ratio for Outlook is quite different to my manual calculation- as below.
Sno Outlook PlayAbove manual "Gain Ratio (Sno) 0.31" calculated value matching to rapid miner "Gain Ratio (Sno) 0.310917507 ~ 0.31" calculation-as below, but above manual "Gain Ratio (Outlook) 0.13" is not matching to rapid miner "Gain Ratio (Outlook) 0.331559707 ~ 0.33" calculations
---- -------- ----------
A1 overcast Dont Play
B2 overcast Play
C3 rain Play
D4 rain Play
E5 rain Play
Following are my calculations for Gain ratio
Entropy for Outlook
H (Outlook) : Overcast
-1/2 log2 (1/2)-1/2 log2 (1/2)
-0.5 (-1) - 0.5 (-1)
H (Outlook) : 1
H (Outlook) : Rain
-3/3 log 2 (3/3)
-1 (0)
H (Outlook) : 0
-----------------------------------------------------------------------
Information Gain for outlook
I (Outlook) = 2/5*(1)+3/5 * (0)
=0.4
-----------------------------------------------------------------------
Entropy for Sno attribute
H (Sno) : A1
H (A1)= -1/5 log2(1/5)
0.0464
H (Sno) : B2
H (B2)= -1/5 log2(1/5)
0.0464
H (Sno) : C3
H (C3)= -1/5 log2(1/5)
0.0464
H (Sno) : D4
H (D4)= -1/5 log2(1/5)
0.0464
Hence
H(E5) = 0.0464
------------------------------------------------------------------------------
Information Gain for Sno attribute
I (Sno)
=1/1*log2(1/1)+1/1*log2(1/1)+1/1*log2(1/1)+1/1*log2(1/1)+1/1* log2(1/1)
=0
------------------------------------------------------------------------------
I (Outlook , no partition)
I(Outlook,no partition) =-1/5log2 (1/5)-4/5 log2 (4/5)
=-0.2*(-2.32192809)-0.8(-0.321928095)
=0.464385618+0.257542476
=0.72
-----------------------------------------------------------------------------
Entropy before - Entropy After for Outlook
I (Outlook ,no partition)-I (Outlook)=0.72-0.4
=0.32
Entropy before - Entropy After for Sno
I (Outlook ,no partition)-I (Outlook)=0.72-0
=0.72
------------------------------------------------------------------------------
Gain Ratio :
Intrinsic information 5*(-1/5*log2(1/5))
5*(-0.2(-2.32))
5*(0.464)
2.32
Gain Ratio (Outlook)= I (Outlook)/Intrinsic information
= 0.32/2.32
= 0.13
Gain Ratio (Sno) = I (Sno)/Intrinsic information
= 0.72/2.32
= 0.31
Rapid Miner Gain ratio calculation
Why it so ? i am using "Weight by Information gain ratio" operator in rapid miner.
Sno 0.310917507 ~ 0.31
Outlook 0.331559707 ~ 0.33
Thanks
Sid
0