Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the

**Register**button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.# "Decision Tree Algorithm for Imbalanced Data Sets"

spitfire_ch
Member Posts:

**38**Maven
Hi board,

a problem I face in nearly every analysis I perform (field of direct marketing) is a huge imbalance of data. Let's say, a marketing campaign was performed. Out of 10'000 customers that were addressed, only 300 reacted (if we were lucky). My job now is to find a pattern, which would help to decide which customers to address in future campaigns, so that - let's say - only 5'000 customers have to be addressed and we would still get 300 responders.

In an approach using a decision tree algorithm, one would use something like "Response: True/False" as a label. Apparently, 300 out of 10'000 is a huge imbalance of data, and so far I didn't manage to get any conclusive results from a decision tree. Mostly there is just one node that says: customer does not react One would have to use heavy over- or undersampling.

I just stumbled over a recent paper, where they have developed a new decision tree algorithm, that addresses exactly this problem. According to them, it performs better than under-/oversampling. I'll post the abstract below. The entire document can be downloaded freely from: http://www.nd.edu/~dial/papers/SDM10.pdf

It also contains the algorithms in pseudo-code. Maybe this would be an interesting addition to Rapidminder, alongside with over- and undersampling.

We propose a new decision tree algorithm, Class Confidence Proportion Decision Tree (CCPDT), which is robust and insensitive to size of classes and generates rules which are statistically significant. In order to make decision trees robust, we begin by expressing Information Gain, the metric used in C4.5, in terms of confidence of a rule. This allows us to immediately explain why Information Gain, like confidence, results in rules which are biased towards the majority class. To overcome this bias, we introduce a new measure, Class Confidence Proportion (CCP), which forms the basis of CCPDT. To generate rules which are statistically significant we design a novel and efficient top-down and bottom-up approach which uses Fisher’s exact test to prune branches of the tree which are not statistically significant. Together these two changes yield a classifier that performs statistically better than not only traditional decision trees but also trees learned from data that has been balanced by well known sampling techniques. Our claims are confirmed through extensive experiments and comparisons against C4.5, CART, HDDT and SPARCCC.

Thank you very much for considering this and best regards

Hanspeter

a problem I face in nearly every analysis I perform (field of direct marketing) is a huge imbalance of data. Let's say, a marketing campaign was performed. Out of 10'000 customers that were addressed, only 300 reacted (if we were lucky). My job now is to find a pattern, which would help to decide which customers to address in future campaigns, so that - let's say - only 5'000 customers have to be addressed and we would still get 300 responders.

In an approach using a decision tree algorithm, one would use something like "Response: True/False" as a label. Apparently, 300 out of 10'000 is a huge imbalance of data, and so far I didn't manage to get any conclusive results from a decision tree. Mostly there is just one node that says: customer does not react One would have to use heavy over- or undersampling.

I just stumbled over a recent paper, where they have developed a new decision tree algorithm, that addresses exactly this problem. According to them, it performs better than under-/oversampling. I'll post the abstract below. The entire document can be downloaded freely from: http://www.nd.edu/~dial/papers/SDM10.pdf

It also contains the algorithms in pseudo-code. Maybe this would be an interesting addition to Rapidminder, alongside with over- and undersampling.

We propose a new decision tree algorithm, Class Confidence Proportion Decision Tree (CCPDT), which is robust and insensitive to size of classes and generates rules which are statistically significant. In order to make decision trees robust, we begin by expressing Information Gain, the metric used in C4.5, in terms of confidence of a rule. This allows us to immediately explain why Information Gain, like confidence, results in rules which are biased towards the majority class. To overcome this bias, we introduce a new measure, Class Confidence Proportion (CCP), which forms the basis of CCPDT. To generate rules which are statistically significant we design a novel and efficient top-down and bottom-up approach which uses Fisher’s exact test to prune branches of the tree which are not statistically significant. Together these two changes yield a classifier that performs statistically better than not only traditional decision trees but also trees learned from data that has been balanced by well known sampling techniques. Our claims are confirmed through extensive experiments and comparisons against C4.5, CART, HDDT and SPARCCC.

Thank you very much for considering this and best regards

Hanspeter

Tagged:

0

## Answers

38Mavenhttp://videolectures.net/ecmlpkdd08_cieslak_ldtf/

Best regards

Hanspeter

849MavenHere are some more..

http://www.google.fr/search?q=imbalanced+decision+tree

In general, when I see that a question raises this many different answers, I start to feel it may be unanswerable. ;D But luckily in this case you have an alternative - presumably you get paid for this work, so why not invest in some RapidMiner consultancy http://rapid-i.com/content/view/3/76/lang,en/? I know they know.

38MavenImbalance just seems to be a common problem - which is also supported by your google query. That's why I was wondering whether adding one or two algorithms to RapidMiner, that can potentially deal with this problem in a better way than already implemented algorithms, wouldn't benefit RapidMiner.

Common problems ask for some kind of solution

I've already had some success by combining the "Generate Weight (Stratification)" operator with the Weka ADTree algorithm. Weighted examples clearly improve the ability of the algorithm to deal with imbalance. But I am quite sure that some algorithms, specifically designed to deal with this problem, would do even better.

This would broaden the application spectrum of RapidMiner. So, this is more of a suggestion ...

Best regards

Hanspeter

849MavenBefore Rapido gets even more operators, perhaps we should check out other avenues that are already covered. For example you could try tuning the perceived costs and benefits of the entire classification matrix, with the MetaCost operator, as well as pre-processing into clusters ... The possibilities are many, and time is short..!

38MavenBest regards

Hanspeter