Options

Gain Ratio vs. Gini

ollestratollestrat Member Posts: 9 Contributor II
Hello all,

In the algorithm description "Weight by Tree Importance" I found a remark concerning the split criterion of trees:
This algorithm is implemented following the idea from "A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data" by Menze, Bjoen H et all (2009). It has been extended by additional criterias for computing the benefit created from a certain split. The original paper only mentioned Gini Index, this operator additionally supports the more reliable criterions Information Gain and Information Gain Ratio.
Is there any theoretical (or maybe empirical) background, in how far Gain Ratio could be superior to the Gini Index? Would be very interesting because I indeed experienced a better performance for the Gain Ratio, compared to a setup with the Gini Index, but I have no idea why and if its statistically significant. Maybe, who ever wrote this algorithm description/wiki-entry can point me to a reference paper or the basic theory behind this assumption?

Ole

Answers

  • Options
    steffensteffen Member Posts: 347 Maven
    Hello ollestrat

    This is indeed a very interesting question. Here are my cents:

    information gain vs gini index
    Given how both values are calculated (see e.g. here), the difference should be unimportant. This paper indeed states in its conclusions (I have not read the whole text), that both measure only disagree 2% of the time.

    information gain ratio vs gini index
    Information Gain Ratio (formula see e.g. here: http://en.wikipedia.org/wiki/Information_gain_ratio) adds another factor to penalize attributes with too many different values (in case of discrete attributes). Two many values mean that the base of every entropy calculation is rather small (on the average). Who do you trust more ? 1/2 or 5/10 ?

    So the resulting interesting question is whether a gini index-ratio would perform as well as information gain ratio. I am pretty sure that a paper about that has already been written and forgotten ;).

    regards,

    steffen

     
  • Options
    rakirkrakirk Member Posts: 29 Contributor II
    I was going to post a question about this, but I figured this thread was very similar and both of you may have a good answer.

    You talk about gain ratio. I'm interested in somehow measuring information compression. Have you found a way to do this in RM?
  • Options
    steffensteffen Member Posts: 347 Maven
    Hello rakirk

    "information compression" ? InformationGain (Ratio) is such a measure.
    See for example: http://www.autonlab.org/tutorials/infogain11.pdf

    if this is not what you are looking for, please be more specific ;)

    regards,

    steffen
  • Options
    rakirkrakirk Member Posts: 29 Contributor II
    Hi Steffen,

    Thanks and yes you are right that IG is compression- and it is useful for classifying.

    I guess I need to be more specific: I am interested in calculating the level of compression of a classifier on data (ie. even though information gain as a learning approach would -> higher compression, I'd like to find a way to quantify the difference between say NBN and IG on a model in terms of compression). Maybe this isn't done a lot, but I have been reading papers about how compression is a good measure for real-world efficiency when compared to accuracy. Is there a way to do this with the performance operator somehow?

  • Options
    steffensteffen Member Posts: 347 Maven
    Hi rakirk

    What is NBN ?  :)

    Additionally, I do not quite understand how you define "compression". Here are some related concepts: In the sense of the second concept, one model is better than another one if the complexity is smaller (so that the danger of overfitting is reduced) meanwhile the performance is comparable. I have always used this concept as a rule of the thumb, but never actually calculated it. It never played a role me, but this could be a sign that I am mining data in the wrong parts of the world or that I am simply doing it wrong ;).

    Regarding your second question: As far as I know, this is not possible within rapidminer. You can imagine that calculating such a measure is not trivial (operators may implement a interface where they specify their complexity, but what to do when different optimization techniques come into play ?).

    hope this was helpful,

    steffen
  • Options
    wesselwessel Member Posts: 537 Maven
    Hey,

    In WEKA there is something as Serialized_Train_Set_Size, and Serialized_Model_Size, might be possible to use these as a description length measure.

    The latest version of the WEKA C4.5 decision tree uses MDL to select the best tree, which trades off model complexity and accuracy.

    Best regards,

    Wessel
  • Options
    rakirkrakirk Member Posts: 29 Contributor II
    Hi weseel, thank you for the suggestion, I actually plan on using this for a decision tree problem I have been having. I will have to let you know how well it works.

    @steffen- sorry for the confusion in my previous question, I have found the answer to the question I had in my head, though I did not express it well in words. I was looking for the technique of measuring compression: KL divergence

    regards,
    rk
Sign In or Register to comment.