Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Normalize variables
Hi,
I'm trying to create a predictive model for churn. Some of the variables I'm using are the percentage change in sales from month to month. In order to control the outliers on the positive side (e.g. 200% increase in sales from month-5 to month-4) I set a cap at 3 (300%). On the negative side (i.e. drop in sales) the most a customer can drop is -1 (-100%), but I have many of these cases. My distribution is pretty normal except for these customers, which is giving me a bimodal distribution.
Is there any calculation I can do with this variable to normalize the distribution including the -1 (-100%) instances? Or if there is no way to do this, any other suggestions would be great.
Thanks in advance for your help.
Keith
I'm trying to create a predictive model for churn. Some of the variables I'm using are the percentage change in sales from month to month. In order to control the outliers on the positive side (e.g. 200% increase in sales from month-5 to month-4) I set a cap at 3 (300%). On the negative side (i.e. drop in sales) the most a customer can drop is -1 (-100%), but I have many of these cases. My distribution is pretty normal except for these customers, which is giving me a bimodal distribution.
Is there any calculation I can do with this variable to normalize the distribution including the -1 (-100%) instances? Or if there is no way to do this, any other suggestions would be great.
Thanks in advance for your help.
Keith
0
Answers
Cluster 1: Outliers negative
Cluster 2: Normal decrese
Cluster 3: stable
...
...
...
Cluster n: Extreme growth
This will work as long as I make the interval range for the "Outlier negatives" smaller than the other bins. In other words, in order to NOT include too many instances in the "large drop" bin I'd have to have the range from -100% to, let's say, -90%, while the other bins would have a much larger range (e.g.-89% to -40%).
Statistically speaking, is it OK to have bins with different ranges like that?
Keith