RapidMiner

Understand GBT Model Output

Learner II shraddha_neema
Learner II

Understand GBT Model Output

Hello,

 

Any help in this matter would be really appreciated. 

I am using  GBT operator to train my model on a customer churn example set. I received approx. 80 % accuracy with GBT Model. Now my issue is to how do I related this GBT model output with business processes.

How should I communicate the GBT results with business folks to understand why specific customer is churn and what variables contributed to Terminated status instead of Active customer status.

 

Another question I have in mind is, How do I calculate the threshold variable limits that make customers to change their mind? That way we can watchful on certain metrics to prevent churn.  

 

Here is the result from GBT model

 

Model Metrics Type: Binomial

 Description: N/A

 model id: rm-h2o-model-gradient_boosted_trees-422159

 frame id: rm-h2o-frame-gradient_boosted_trees-324798

 MSE: 0.10739042

 R^2: 0.5584855

 AUC: 0.9389837

 logloss: 0.35373378

 CM: Confusion Matrix (vertical: actual; across: predicted):

                      Active                Terminated             Error           Rate

Active                                 590                   139  0.1907  =   139 / 729

Terminated                              53                   470  0.1013  =    53 / 523

              Totals                   643                   609  0.1534  = 192 / 1,252

Gains/Lift Table (Avg response rate: 41.77 %):

 

Group  Cumulative Data Fraction  Lower Threshold      Lift  Cumulative Lift  Response Rate  Cumulative Response Rate  Capture Rate  Cumulative Capture Rate         Gain  Cumulative Gain

      1                0.01038339         0.926587  2.393881         2.393881       1.000000                  1.000000      0.024857                 0.024857   139.388145       139.388145

      2                0.02076677         0.926248  2.393881         2.393881       1.000000                  1.000000      0.024857                 0.049713   139.388145       139.388145

      3                0.03035144         0.926021  2.393881         2.393881       1.000000                  1.000000      0.022945                 0.072658   139.388145       139.388145

      4                0.04073482         0.925124  2.393881         2.393881       1.000000                  1.000000      0.024857                 0.097514   139.388145       139.388145

      5                0.05111821         0.924748  2.393881         2.393881       1.000000                  1.000000      0.024857                 0.122371   139.388145       139.388145

      6                0.10063898         0.913532  2.393881         2.393881       1.000000                  1.000000      0.118547                 0.240918   139.388145       139.388145

      7                0.15015974         0.872454  2.393881         2.393881       1.000000                  1.000000      0.118547                 0.359465   139.388145       139.388145

      8                0.20047923         0.754298  2.355883         2.384344       0.984127                  0.996016      0.118547                 0.478011   135.588333       138.434408

      9                0.30031949         0.570023  1.953407         2.241081       0.816000                  0.936170      0.195029                 0.673040    95.340727       124.108051

     10                0.40015974         0.429297  1.378876         2.025960       0.576000                  0.846307      0.137667                 0.810707    37.887572       102.595955

     11                0.50000000         0.326709  0.957553         1.812620       0.400000                  0.757188      0.095602                 0.906310    -4.244742        81.261950

     12                0.59984026         0.267012  0.459625         1.587421       0.192000                  0.663116      0.045889                 0.952199   -54.037476        58.742072

     13                0.69968051         0.227460  0.344719         1.410095       0.144000                  0.589041      0.034417                 0.986616   -65.528107        41.009455

     14                0.80031949         0.103437  0.132993         1.249501       0.055556                  0.521956      0.013384                 1.000000   -86.700659        24.950100

     15                0.90095847         0.068919  0.000000         1.109929       0.000000                  0.463652      0.000000                 1.000000  -100.000000        10.992908

     16                1.00000000         0.057902  0.000000         1.000000       0.000000                  0.417732      0.000000                 1.000000  -100.000000         0.000000

 

  

                   Variable

Relative Importance

Scaled Importance

Percentage

Field1

445.525879

1

0.49061

Field2

158.352005

0.355427

0.174376

Field3

93.245522

0.209293

0.102681

Field4

51.406567

0.115384

0.056609

Field5

34.961025

0.078471

0.038499

Field6

26.576853

0.059653

0.029266

Field7

19.5725

0.043931

0.021553

Field8

19.506002

0.043782

0.02148

Field9

19.407133

0.04356

0.021371

Field10

13.182694

0.029589

0.014517

Field11

11.111937

0.024941

0.012236

Field12

4.461669

0.010014

0.004913

Field13

3.955152

0.008877

0.004355

Field14

3.564302

0.008

0.003925

Field15

3.276087

0.007353

0.003608

Field16

0

0

0

 

 

Thank You

 

 

13 REPLIES
Guru
Guru

Re: Understand GBT Model Output

Short answer: You can't gain any intuituion from GBT. GBT is an ensemble of trees (sometimes of hundreds of trees); so it is really difficult to interpret it. 

 

I've seen in other software (I can't remember which one) that you hold k-1 variables constant and you change one variable and you plot the forecast of the GBT. Then you can visualize what type of relation exists between label and attribute. 

 

With respect to your second question: Before you find an optimal threshold you have to specify the costs of making mistakes in your classification.  Once you know those costs you can use operators like "Find threshold" to solve for the optimal T.

RM Certified Expert
RM Certified Expert

Re: Understand GBT Model Output

I just came back from giving RapidMiner training and a similar question was raised in the training. How do you explain a complex algorithm like a Neural Net or GBT in laymans terms to a business group? It's hard, especially if the algorithm can handle highly dimensional data or is just complex in it's working. 

 

In your case, explaining it might be a bit easier than say a Neural Net. Everyone understands a decision tree, so you can say that GBT is like a decision tree but better because it generates many more trees (like a Random Forest) and has some special characteristics to help convert your 'weak' hypothesisses into 'stronger' hypothesis.  There's a great high level overview on GBT here: http://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

RM Certified Expert
RM Certified Expert

Re: Understand GBT Model Output

The model output does also provide some insight into variable importance.  It's this section below.  It won't tell you why a specific case has the prediction that it does, but at least it gives you an overall sense of which attributes and their relative strengths are most important in the predictions from that GBT model:

 

Variable

Relative Importance

Scaled Importance

Percentage

Field1

445.525879

1

0.49061

Field2

158.352005

0.355427

0.174376

Field3

93.245522

0.209293

0.102681

 

 

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
RM Staff
RM Staff

Re: Understand GBT Model Output

Hey,

 

this sounds pretty much like a use case for my Get Local Interpretation operator which is available in operator toolbox. Have a look on it. 

 

If this fits your needs, I am happy to have a look this personally.

 

Best,

Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Learner II shraddha_neema
Learner II

Re: Understand GBT Model Output

Hello, 

 

Thank you for your response. it make sense. but how do you define "the costs of making mistakes in classification". It wil be very helpful, if you can share little more insight on this topic. I will explore this context and see if that helps me identify the threshold value of any specific variable with respect to Lable variable. 

 

Thanks!
Shraddha  

Highlighted
Guru
Guru

Re: Understand GBT Model Output

I'll give you an example: the classical example of mailing an offer (a "catalog") to a customer

 

You send the catalog to 1000 people at random and now you want to develop a model to decide who you should send it to in the general population. If you send the catalog and the customer buys from it, from gain $10 (this is net of all costs including the catalog). If she does not buy anything you lose the cost of the catalog (say $1). 

 

How would you decide what cut-off probability to use to decide to whom you should send a catalog?

 

DECISION 1: Mail the catalog

 

With probability you make $10 and with probability (1-p) you lose $1.    Expected Value = 10*p - 1*(1-p) = 11*p - 1

 

DECISON 2: Don't mail it.

 

Then with certainty you will make $0.   Expected Value = 0

 

You should mail when expected value of Decision 1 is greater than EV of Decision 2. 

When :   11*p - 1 > 0 

Or when : p > 1/11

 

That's you optimal cut-off point. It does not maximize "accuracy", but you don't care about "accuracy" you care about profit. 

 

You can construct different examples of the same type and find in each case that the optimal p is different from the default p=0.5. Of course, if the costs are symmetric, then p*=0.5.

 

The other example I was going to give you is the problem of classifying a transaction as fraudulent or not. I have a dataset with 300,000 transactions in a day. Only 500 are fraudulent.  Think about the assymetric costs of this example. 

 

 

 

 

RM Certified Expert
RM Certified Expert

Re: Understand GBT Model Output

And there is a Performance(Costs) operator that allows you to enter these type of asymmetric costs in RapidMiner and optimize your model directly on those costs.  Check it out!

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Learner II shraddha_neema
Learner II

Re: Understand GBT Model Output

Thank you for the explanation. 

so should i need to identify the cost of every data points which is being used to define the model? or you are saying we need to define the cost only on the Label variable? 

 

I have an impression from the explanation and after trying the operator, you are saying to define the cost on the Label variable. If that is true, I am more interested to identify the threshold value for each important variable in the model. so that I can explain the business users that if this variable reached to  particular band, it affect the client decision. Let me know if this make sense. 

 

Thanks again for all your help and taking time to look into my question. 

 

Regards,

Shraddha 

 

Learner II shraddha_neema
Learner II

Re: Understand GBT Model Output

Thank you Martin, I think this operator may provide little more insight to understand model result with respct to business problem. 

I will explore this operator more. Thanks for pointing it out to me. 

 

Regards,

Shraddha 

 

 

Polls
How can RapidMiner increase participation in our new competitions?
Twitter Feed