How does RM Decision Tree algorithm choose between two attributes that produce equally good splits?
It seems possible that a DT algorithm will find cases that are split equally well into two different groups (branches) by say two different case attributes. How does it choose one attribute to use, in these circumstances? And what can we say about the likely structure of the tree that would have been made had the neglected attribute been used? I have not seen anything like this discussed in the literature on DTs
Best Answer
-
IngoRM Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Well you could do this. However, I just do not see a chance for a massive improvement here to be honest. See my answer from before - I don't think that there really is a huge risk given that the whole tree is built based on choices which might be sub-optimal...
0
Answers
Hi,
looking at the code, I can say that in such a case, we simply use whichever attribute comes first in the data.
Regards,
Marco
...which is what some others have told me.
In which case, how do we know that the other attribute, if it was used instead, does not lead on to a better performing DT structure?
You could re-order the attributes and test it?
Interesting topic indeed. We do not know if the other attribute would not lead to a better DT. But having said this, this is actually true for all of the splits. Keep in mind that picking the one with the biggest information gain is a heuristic as well. There is no guarantee that a pick is leading to an optimal DT (optimal = best predictive power). So it could easily be that making a different pick at each split, let's say the one with the 2nd or 3rd best information gain, might lead to a more accurate tree. But the only way to find this out is to go bruteforce through all options which is not computationally feasible. And even then you might overfit to your test set you use to find out which split is the best.
So to summarize: using information gain (and the other approaches) is a heuristic itself and does not guarantee an optimal tree to begin with. Making an arbitrary split between two good attributes does not make this heuristic much more, well, "heuristic-y" in my opinion ;-)
Cheers,
Ingo
I could do this, where I know that two attributes have been equal contenders
But how do I find these in the first instance? Or otherwise be confident they dont exist?
Perhaps for binary data, some sort of Hamming distance measure of similarity between all attributes could point to where there is such a risk?
Would some sort of "prior to DT analysis" feature selection process that maximised diversity among the attributes be a best bet against such a risk?
If this is the case then BigML should signal to the users when two or more attributes perform equally well, so they can explore what happens when one of these is deleted from the data set