Bad Performance of ChurnPrediction

TomatenmarkTomatenmark Member Posts: 4 Contributor I
Hey there,

I created a process for ChurnPrediction. My label in the data set is Churn.
1 is for yes and 0 is for no.

I used a decision tree and cross validation opeartor as you can see in my process.
But the my model does not make predictions that a customer will move/churn.
All customers are predicted to stay, therefore my class recall of true 1 is 0%.

I can not find the problem why my predictions are so bad.
Please find attached the data file, my process and a screenshot of the performance vector.

Thanks for your support :blush:


  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    the default settings of Decision Tree are often good, but not in every case. They are meant to avoid too much overfitting, but this could be inappropriate for your data.

    Try disabling pruning and postpruning first. Check the resulting model. Chances are that it will be a very complex tree (likely overfitted), but it will predict both categories, even if the cross validation will show bad results. If this works, you can enable pruning and postpruning again and play with the parameters until you find the optimum.
    The best way to do this is by using Optimize Parameters. There's a readily usable building block in the Community Samples repository:
    Community Building Blocks/Optimize Decision Tree. 

    Here's an Academy video on parameter optimization:

    Lastly, maybe Decision Tree is not the best learner for your data. You could try Gradient Boosted Trees, Random Forests, Naive Bayes, Logistic Regression, Deep Learning, Support Vector Machines etc.

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    I saw your data, and I think the following:
    • You have some variables that identify the customer uniquely, should remove these from the model.
    • Most of your variables are numeric. Therefore
    Also, I would try to check if there are correlated variables. Here it's 2:20 am and I couldn't keep playing with it.

    As @BalazsBarany already mentioned, it would be good to use "Optimize Parameters" to figure out which parameters are better. I would also play a little with the statistics of the data, to see how stable is your data, how many records are repeated, etc. e.g. (textness, uniqueness, correlation to the target variable, etc).

    Hope this helps,

  • Options
    TomatenmarkTomatenmark Member Posts: 4 Contributor I
    edited February 2020

    thanks for your answer. As you can see I used üptimize_parameters_grid in process, I tried many different parameter combinations for the decision tree, but still not working. 

    To find correlated variables I saw that I can use a correlation matrix, I will try it out.

    So that most of my variables are numeric is fine or is it a problem?
    Why shall I remove the CustomerID? In my set role operator  I told rapidminer that it is a id column.

    Hope my explanations are understandable to you :smile:
    Thanks in advance,

Sign In or Register to comment.