"seemingly inconsistent result in prediction with Decision Tree MetaCost"

dan_agapedan_agape Member Posts: 106  Guru
edited June 2019 in Help
I've built a process that intends to predict customers that are likely to churn (i.e. leave service provided by a company). I used two DT (decision tree) algorithm implementations based on C4.5 - the one of RM, and J48 of Weka. In particular DTs are useful in profiling the potential churners here, such that you learn about their characteristics. Meta learning via using the MetaCost operator was included to encourage the two algorithms to detect more possible churners. Just playing around with the parameter tuning, I discovered some abnormality:

RM's C4.5 implementation generated a tree formed by the root only: decision churn=No (expected because most customers do not churn). This is not a problem in itself and can be changed easily if you retune parameters, in particular the minimal gain. However what is a problem is that although the prediction with this tree is to be No for all instances, few instances are predicted Yes ...ย 
Since the tree has just one node, the confidence for No is the same for all instances, and is equal to 0.726. I hardly see under these circumstances why a few instances, with the same confidence of 0.726 for No as any other instance, are predicted Yes.

Another inconsistency: the confidences for the classes No and Yes for one instance do not add up to 1.

The dataset is not publicly available for the process to be tested, but if one wants to check this likely inconsistency, I could make available the image files of the instances' scores that sufficiently illustrate what said above, the confusion matrix, the tree (however I'm not sure if the insert image button works for the posting).

Cheers,
Dan
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,529   Unicorn
    Hi Dan,
    it's a little bit difficult for me to check if there is a bug in the code or which other reason this behavior might have without the process and the data.
    Did you try to build a process reproducing this behavior with only data generators?

    PS: Did you thought about becoming an enterprise customer? We could sign a NDA or make a webex session to make a reliable diagnostic and solve the problem.

    Greetings,
    ย  Sebastian
  • dan_agapedan_agape Member Posts: 106  Guru
    Hi Sebastian,

    Thanks, I am not yet at this stage of becoming a RM enterprise customer. However, as anybody here, I am happy to bring my small contribution in improving RM, in the meanwhile, by spotting whatever inconsistency/bug I may find to the wonderful RM team. I could email the image files, and also the process, if this may be of any help, but unfortunately not the dataset which, as said, is not publicly available. Please let me know if you want me to do that. For my work, that's fine, as I dispose of alternative DM software. For now I am just exploring RM and other DM suites in view of possible future critical use.

    Best wishes,
    Dan
  • dan_agapedan_agape Member Posts: 106  Guru
    BTW, the dataset does not belong to me so personally I couldn't sign an NDA anyway.

    I'll try once and if by chance (which probabilistically speaking is very small any way) the same problem repeats with generated/artificial data, I will let you know.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,529   Unicorn
    Hi Dan,
    please email me the process and the pictures. Might be I will find some time to look at it this (comparably relaxed) week.
    I will pm you my email address.

    Greetings,
    ย  Sebastian
Sign In or Register to comment.