Interpretation of different classification models - comprehension questions

cwerning · June 2016

Hello everyone,

I have some questions regarding the evaluation of my binary classification analysis. It would be great if you can share your thoughts about my consideration and decisions, because I am pretty knew to RapidMiner and machine learning at all. In addition, I want to clarify that I don't earn money with this analysis. It is just a case study and I hope that other people who are interested in this topic can learn from this thread and my questions too.

Background: The test and training set contains about 450 weighted examples, distribution of the label is 50:50, used 20 attributes for setting up the model

I noticed during the analysis that my data is very easy to split. Therefore the created decision trees have max. 3 levels and an accuracy of 98%-99%. This result sounds like a clearly overfitted model(marked with a red line in the results screenshot). In my opinion the result indicate that the models have a small variance and a small bias which leads to the overfitting correct? Even if I dont use pruning the tree has max. 4 levels. I tried to lower the pruning parameters to optimize the model but this was not successful.

Afterwards I created some models with different classification techniques. I created some multi layer perceptrons with the neural net operator and some SVMs with LIBSVM Operator. Please take a look at the attached screenshots. They describe the configuration of each parameter. Every parameter that is not mentioned on the screenshot is default.

MLP parameters

SVM parameters

The next screenshot displays the results of my models. I have used a 10, 5 fold and leave-one-out cross validation with stratified sampling.

accuracy of the models

Please take a look at the results above. My next step was to eliminate the models with a deviation higher then 20 to choose a stable model (blue lines in the result screenshot). The deviation is the criterium of the model stability isn't it? I think it looks like the model is more stable after a 5 fold cross validation than after a 10 fold or leave-one-out validation.

The models which are marked with a yellow line, are models which I prefer because their accuracy is high with a small deviation.

Furthermore I think that SVM 6 is better than SVM 7 because it uses only 48 support vectors. SVM 7 uses 258 support vectors.

What do you think about my suggestions? At the moment I am searching for the best MLP model, but I don't know how to find it. Is there a way to detect overfitting in neural networks?

Finally, I have a question regarding the evaluation charts. The following charts display the lift and ROC results of model SVM 6.Lift chartROC Chart

Can someone please explain me the ROC threshold curve? I think I don't understand it properly.

Thanks in advance for your replies! I really appreciate it.

Best regards,

Christopher

MartinLiebig · June 2016

Dear Christian,

two quick things, because I am a bit in a hurry

1. I would argue that the decision tree is good? Why not checking for overtraining with a holdout set?

2. the AUC is calculated with different thresholds on the confidence. the blue line indicates which threshold (displayed on the y-axis) you need to take to get the tpr/fpr.

Best,

Martin

cwerning · June 2016

Hi Martin,

first, thank you for your quick reply!

1. Good idea, but I thought that checking the model with a holdout set is already done by the cross validation? I dont have such a large data set that I can provide a additional sample.

2. Thank you, finally I got it!

MartinLiebig · June 2016

Hi Christian,

well - kind of yes. X-Val is checking whether something you did on the left hand side does yield to overtraining. Nevertheless there can be overtraining by things you did not do on the left hand side. Most prominently:

Optimization
Choice of Algorithm
Preprocessing Decisions

What you are currently doing is choosing an algorithm. This is usually not performed in a X-Validation fashion. You might think of this as a parameter of your process (which could be optimized using Select Subprocess). One needs to be a bit careful that those things do not yield to overtraining.

Long story short: You are also right. For your sole decision tree it should be fine. That only enforces my question: Why do you think that your decision tree is overtrained? I rather would ask the question if there is some label information in your attributes which can be exploited? Or another reason why the decision tree is way stronger than the SVM?

Edit: and w.r.t the std_dev. I am still in deep thought about this paper: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf and what it means on how to interprete the Std_dev of the CV...

And by the way, this is a great talk on CV: https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/cv_boot.pdf also stating the problem that std_dev of X-Val is tricky

~Martin

cwerning · June 2016

Hi Martin,

I got your point about the X-Validation and the holdout set. Is there a way to check the previous actions for overtraining without a holdout set?

If choosing an algorithm is usually not performed in a X-Validation, how would you choose it? I have learned it this way and thought it was right. At least the X-Validation should prevent overtraining in the most cases.

I think my decision trees are overtrained because I can not believe that my model is perfect. I think the reason for such a good performance is the data itself. The examples are quite similar and I think that is the problem.

Thank you for linking the paper and presentation. I tried to understand the paper, but I think my knowledge in this topic is not sufficient enough. The say it is tricky to use the deviation, but I still dont understand it.

Edit: Finally I got an idea because of your first statement! The main reason why my models are so accurate is because of the attribute selection. Part of the attributes for the analysis are especially two attributes with a high weight (0,7 & 0,67 based on the information gain). Without these two attributes the performance is of course much lower. I consider to remove these two attributes in the preprocessing to make the model "soft". What do you think about this proposal?

Thanks for your reply and best regards,

Christopher

MartinLiebig · June 2016

Christopher,

i would say X-Val makes it harder to overtrain and usually you realize that your algorithm is overtrained. This is slightly different from preventing overtraining. The punchline of the paper is, that the std_dev of the x-val might be biased. As i said i am still not sure how impactful this is.

Are those two attributes nominal? That would somewhat explain why the tree is good. Why do you want to make them softer? Either they are this powerful or they arent. If there is an easy model on the two than i would generally take it. The only thing i could imagined is a biased sample?

~Martin

cwerning · June 2016

Good morning Martin,

thank you very much for the explanation. I will search for different sources according this topic.

No the attributes are numerical and describe the portion of two chemical elements. I think you are right with your guess of a biased sample. Maybe I should tell you a little bit more about the general analysis. My goal is to classify the suitability of different steelgrades for a special production method, but my training and test sample contains only examples of alloyed steel with these two chemical elements. The problem is that I did not know that until the day before yesterday.

If you keep this in mind my samples are biased. Therefore I thought about removing the two attributes from the analysis to make the model softer, because I want to predict the class of steelgrades that are not mainly alloyed with these two elements. Otherwise I built a model which fits only with a high accuracy to steel grades that contain a high amount of these two chemical elements. According to this the model would fail on every new sample with a different alloy.

What do you think about this concern? Should I remove the two attributes from the analysis or renew their weight to a lower level(like other chemical elements)?

Martin, I cannot emphasise strongly enough how thankful I am for your help. I really appreciate it!

Best regards,

Christopher

MartinLiebig · June 2016

Christopher.

you are always welcome. That's what the community is for. If you come back and help others afterwards everything is great.

You defintily need to remove the attributes which have label information in it. Otherwise you might fool yourself and do not predict your label but something else (like the famous tank story.. http://lesswrong.com/lw/7qz/machine_learning_and_unintended_consequences/ ). Not using a decision tree is not an option. It could be that the neural net finds the same pattern the tree is exploting.

Interesting that you work in the steel industry. We have some use cases there. Especially we got a science project in this area - http://www.presed.eu/ . Are you involved in this?

~Martin

cwerning · June 2016

Good morning,

my final solution is to keep both attributes and to reduce the data set which has to be predicted.

Now I want to investigate the potential offerfitting of the trees. Therefore I thought about logging the Training and Test error. Martin gave me the hint to compare them on each fold of the cross validation as you can see below.

But I think we made an error in reasoning according to that point. In my opinion we should not log the error on X-validation, but on the depth of tree or am I wrong? Otherwise it is just a comparison of the error in different samples.

Tree Test/Train Error 5-fold X-Val

best regards,

Christopher

MartinLiebig · June 2016

Hi Christopher,

i think what you want is a chart with "complexity" (most likely min_gain) of the tree on the x-axis and performance test/train on the y axis. This can be achived by using a Log operator on both sides of x-val. You propably want to use Log To Data and average over the k-folds of X-Val.

~Martin

cwerning · June 2016

Hi Martin,

yes that is exactly what I want to achieve. But why should I use the minimal gain on the x axis? The minimal gain is a fixed value based on the parameter settings of the tree. Or do you want to loop over the different values for the minimal gain? Otherwise there is no development of the curve or am I wrong?

Best regards,

Christopher

MartinLiebig · June 2016

Yep,

i would loop over minimal gain. In my experience this is the most important prepruning factor.

~Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Interpretation of different classification models - comprehension questions

Answers