discretize by variance?
Hi. I have a DB, each row represents a person. One of the columns is the income. I tried to apply a KMeans to group the data set. Originally, I normalized and applyied logs to the income column, but the either way, results are not logical, because it groups people very dissimilar in terms of income. Although income is not the only variable, it is an important one. Because income has a big coefficient of variation (1000%), I though I can construct bins with similar coefficient of variation, i.e., up to 30%. After discretizing, I should transform the bins to numerical values in order to be used by the kmeans operator.
It can be done in rapid miner? Any ideas that can help me.
It can be done in rapid miner? Any ideas that can help me.
0
Best Answer

omoratto Member Posts: 5 Contributor IIBrian, thank you so much for your feedback. I tried your suggested approach by normalization not by z, unfortunatelly it came up with two groups. What I decided was to apply an outlier detection model before clustering the results, in that way, I Split the dataset into two sections (outlier, nonoutlier) and applied kmeans to each section. It worked pretty well.
Thank you1
Answers
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thanks for your answer. The issue is that i want to discretize the income directly because with the logtransform kmeans is grouping individuals with very large income (i.e 6 Millions) with low income individuals (i.e., 60 k). I want to bin income in a way that each bin has low coefficient of variance (VC), i.e. < 30%, but doining directly on RM.
Or there is another way to accomplish this?
Thanks.
Hi,
i am not sure how this should work with variance? I mean, the variance of higher values is natually bigger? Usually you take other measures into account. Did you have a look at Discretize by Entropy?
~Martin
Dortmund, Germany
Next, if you do a log (base 10) transformation of your incomes, you should absolutely be able to specify equal width bins that will accommodate your desire to have a maximum proportional income range within each bin. If we are talking about annualized numbers, typically income is going to range from the tenthousands perhaps up the millions, which is actually only 4 orders of magnitude, which means your log values will mostly be between 4 and 7. If you selected bin size of 0.2 (on the log scale), this would ensure that within any given bin, the variance percentage (sigma/mean) was not more than approximately 30%. (Check it out on a spreadsheet, it's just math!).
And as I mentioned, if you want your bins not to be equal in width for whatever reason, you can still use the Discretize by User Specification to simply create whatever bins you think are most appropriate for the actual distribution that you have.
Finally, is there any reason why you believe that 30% is a critical number when it comes to income variance? It seems like that is a fairly arbitrary threshold that you have defined. Perhaps you should look at a more datadriven mechanism to try to determine how income should affect the final model?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts