# New guy ... help interpreting data

I need some help interpreting the output of auto-model. I have a true/false label with 600/11000 values and approximately 12,000 examples. At first glance, the random forest is more accurate, but then the AUC is much higher for the gradient Boosted Trees, and the precision points at decision trees and random forest. I am not an expert in statistics and I would much appreciate if someone can break this down for me and tell me if any of the predictions are statistically meaningful and how I go about determining that.

Thank you!

Model | Accuracy (%) | Classification Error (%) | AUC | Precision (%) | Recall (%) | F-Measure (%) | Sensitivity % | Specificity (%) |

Naive Bayes | 93.1 | 6.9 | 0.859 | 26.7 | 23.0 | 24.7 | 23.0 | 96.7 |

Generalized Linear Model | 94.8 | 5.2 | 0.855 | 40.0 | 13.1 | 19.8 | 13.1 | 99 |

Logistic Regression | 94.7 | 5.3 | 0.848 | 37.2 | 13.1 | 19.4 | 13.1 | 98.9 |

Deep Learning | 93.5 | 6.5 | 0.867 | 31.9 | 29.5 | 30.6 | 29.5 | 96.7 |

Decision Tree | 95.2 | 4.8 | 0.500 | 100.0 | 1.6 | 3.2 | 1.6 | 100 |

Random Forest | 95.3 | 4.7 | 0.739 | 100.0 | 3.3 | 6.3 | 3.3 | 100 |

Gradient Boosted Trees | 94.6 | 5.4 | 0.915 | 40.6 | 21.3 | 28.0 | 21.3 | 98.4 |

## Answers

882UnicornAUC is a much better measure of model performance when you have an imbalanced class distribution, so by that measure the GBT is indeed the best performing model. It is noteworthy that the very simply Naive Bayes is also performing quite well here, and that might be a good starting place or baseline model.

The question of statistical significance is one that is laden with theoretical baggage. The short answer to your question is that all of these models other than the decision tree are giving you some kind of discriminatory tool to use regardless of your theoretical perspective on p-value interpretations (frequentist or Bayesian). Modern machine learning does not heavily emphasize the calculation or role of p-values, unlike the classic statistical approach; instead, it relies on cross-validation performance (you did use cross-validation, didn't you?) to understand model usefulness.

Lindon Ventures

Data Science Consulting from Certified RapidMiner Experts

5Newbie232UnicornThe AUC is not best at describing the rate of survival. It is a performance measure for comparing models. It gives you an idea about the quality of your models.

Imbalanced means that the two values of your label are not split 50:50 but more like 5:95. A simple model (maybe the decision tree) just tells you "Hey, everyone survives" and is 95 % right (because 95 % are actually surviving) but it's not a good model. You see this in the low AUC value.

A good model actually works on distinguishing died and survived and will have a higher AUC. (Take a look at the different AUC curves.)

You might get an even better model by e. g. downsampling the survived class. The most correct way is to do this inside the cross validation's training phase.

It depends on what your goal is. If death is the most important class, and you'd like to have a higher recall of death cases from your model, you could weight these even higher. Then you'll get a model that makes more mistakes on survivors (predicting them as dead) but will catch most of the deaths. If your use case is analyzing possible reasons for deaths and avoiding them, this might be your way to achieve it.

Regards,

Balázs

882UnicornLindon Ventures

Data Science Consulting from Certified RapidMiner Experts

14Contributor IHey guys,

I am dealing with the same problem at the moment and would like to use SMOTE to upsample. Regarding this, I have 2 questions.

1. I have not only 1 attribute that I want to predict true or false, but 10 (different fault categories that should be predicted for each Event). Some of them are 2/3 balanced, some are 5:95. Is it correct to upsample every Attribute seperatly ? My data base would then be around 7 times bigger than before. Or is there another way?

2. @BalazsBarany you said that you would do the up/downsampling inside the cross validation's Training. I thouhgt that you would do it as part of the feature generation process and do the modelling process after that.

If you need more infomation or this does not fit here, I will make a different post and maybe add some Information.

Thank you for your help.

Regards,

Tobias

882UnicornThe reason why sampling is done inside cross-validation is to determine the impact that sampling (which involves some quasi-random processes) has on model performance, which can be significant. Ultimately you may want to do it on the entire dataset when building your final model, but for understanding performance, doing it inside the CV will give you the least biased performance.

Lindon Ventures

Data Science Consulting from Certified RapidMiner Experts

232Unicorn16Maven