ANNOUNCEMENT: RAPIDMINER 9.1 HAS JUST BEEN RELEASED!   PLEASE DOWNLOAD AND GIVE FEEDBACK. ENJOY AND HAPPY RAPIDMINING!   -- @sgenzer – Community Manager

New guy ... help interpreting data

ad2045ad2045 Member Posts: 5 Newbie
edited December 3 in Help
Experts,

I need some help interpreting the output of auto-model. I have a true/false label with 600/11000 values and approximately 12,000 examples. At first glance, the random forest is more accurate, but then the AUC is much higher for the gradient Boosted Trees, and the precision points at decision trees and random forest. I am not an expert in statistics and I would much appreciate if someone can break this down for me and tell me if any of the predictions are statistically meaningful and how I go about determining that.

Thank you!
 
Model Accuracy (%) Classification Error (%) AUC Precision (%) Recall (%) F-Measure (%) Sensitivity % Specificity (%)
Naive Bayes  93.1 6.9 0.859 26.7 23.0 24.7 23.0 96.7
Generalized Linear Model 94.8 5.2 0.855 40.0 13.1 19.8 13.1 99
Logistic Regression 94.7 5.3 0.848 37.2 13.1 19.4 13.1 98.9
Deep Learning 93.5 6.5 0.867 31.9 29.5 30.6 29.5 96.7
Decision Tree 95.2 4.8 0.500 100.0 1.6 3.2 1.6 100
Random Forest 95.3 4.7 0.739 100.0 3.3 6.3 3.3 100
Gradient Boosted Trees 94.6 5.4 0.915 40.6 21.3 28.0 21.3 98.4

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 882   Unicorn
    You can't focus on accuracy here because your dataset is so imbalanced, it is easy to achieve high accuracy by simply predicting the majority class.  In fact, you should probably consider either weighting or sampling to address your class imbalance because it is almost certainly influencing your models.
    AUC is a much better measure of model performance when you have an imbalanced class distribution, so by that measure the GBT is indeed the best performing model.  It is noteworthy that the very simply Naive Bayes is also performing quite well here, and that might be a good starting place or baseline model.
    The question of statistical significance is one that is laden with theoretical baggage.  The short answer to your question is that all of these models other than the decision tree are giving you some kind of discriminatory tool to use regardless of your theoretical perspective on p-value interpretations (frequentist or Bayesian).  Modern machine learning does not heavily emphasize the calculation or role of p-values, unlike the classic statistical approach; instead, it relies on cross-validation performance (you did use cross-validation, didn't you?) to understand model usefulness.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    BalazsBaranyMaerklisgenzer
  • ad2045ad2045 Member Posts: 5 Newbie
    I am not sure I follow. How is the dataset imbalanced? Let's assume that 600 people died and 11000 survived train crashes. I have approximately 50 data points that describe train and people. I have 12,000 crashes and I am trying to predict the likelihood of a passenger dying based on the 50 data points. Are you saying that the AUC is best at describing the rate of survival? Why is that? I used the auto model. The data is straight out of the auto model.


  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 882   Unicorn
    As Balazs says, the rate of survival is not the key issue here because most people survive.  You want to build a model that can help you find out what the key factors are associated with death vs survival, or a way to separate the two classes.  For that, AUC is probably the best performance measure for your model.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
  • t_liebet_liebe Member Posts: 14 Contributor I

    Hey guys,

    I am dealing with the same problem at the moment and would like to use SMOTE to upsample. Regarding this, I have 2 questions.

    1. I have not only 1 attribute that I want to predict true or false, but 10 (different fault categories that should be predicted for each Event). Some of them are 2/3 balanced, some are 5:95. Is it correct to upsample every Attribute seperatly ? My data base would then be around 7 times bigger than before. Or is there another way?

    2. @BalazsBarany you said that you would do the up/downsampling inside the cross validation's Training. I thouhgt that you would do it as part of the feature generation process and do the modelling process after that.

    If you need more infomation or this does not fit here, I will make a different post and maybe add some Information.

    Thank you for your help.

    Regards,
    Tobias

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 882   Unicorn
    If you have 10 different attributes to predict, you are going to need to sample 10 different times and set each attribute as a label to build 10 different models.
    The reason why sampling is done inside cross-validation is to determine the impact that sampling (which involves some quasi-random processes) has on model performance, which can be significant.  Ultimately you may want to do it on the entire dataset when building your final model, but for understanding performance, doing it inside the CV will give you the least biased performance.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    BalazsBaranyt_liebe
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 232   Unicorn
    Also, sampling in the cross validation training phase validates the whole data set. If you downsample before the validation, you lose valuable data for validation.
    Telcontar120t_liebe
  • varunm1varunm1 Member Posts: 16  Maven
    As your dataset is biased, you can either use kappa values which is an inter-rater agreement and Root mean square error for a better understanding of the performance. 5 fold cross validation is recommended in this case. I suggest you be careful in downsampling as in the real world we need to deal with this sort of data.
    t_liebemschmitz
Sign In or Register to comment.