It looks like you're new here. If you want to get involved, click one of these buttons!
Really sorry about such a stupid question, I am very new!
I want to do cross validation with decision tree, here is my dataset:
Here is the setup I have:
Sorry, the rest of my first post was cut off
And here is the setup inside the X-validation:
When I run this, it seems to be doing nothing at all except summarising my data pretty much - here are some screenshots of the results:
Can anyone help me figure out what is going on here - is it something to do with how my attributes are setup in terms of their roles etc? ....
Basically it looks like the model is predicting that everything is YES, so basically a very incorrect model.
Is it possible for you to share the data ?
Are you applying pruning on the decision tree ? Try without or changing the confidence values.
You may also try some other models, since decision tree does not seem to be getting close.
Keep in mind some learners can only predict binary values, some polynominal values, some numbers and there are also limitation on kind of data type that can be input variables.
You can use RM operators to tweak data to meet those, but plan accordingly.
Some other questions to ask, does the data need to be balanced too? Is there any feature generation that can be done?
@Thomas_Ott raises a good point, unbalanced data is an interesting and very frequenct problem in classfications.
As you found empirically, a training set consisting of different numbers of representatives from either class may result in a classifier that is biased towards the majority class. When applied to a test set that is imbalanced, the classifer (not only decision trees) yields an optimistic accuracy estimate. In an extreme case (just like your example), the classifier might assign every single test case to the majority class, thereby achieving an accuracy equal to the proportion of test classes belonging to the majority classes. Some strategies for learning from unbalanced data:
1. Under sampling,
by removing samples from the majority class using an udersampling algorithm, for instance using absolute sized Sample to balance data with specified sample size per calss in RapidMiner
by generating new samples from the minority class using an oversampling algorithm, for instance Bootstraping Sample in rapidminer
3. Cost-sensitive learning,
by chaning the decision tree build algorithm so that the misclassifications of minority class samples have a higher cost than misclassifications of majority calss samples. The MetaCost in rapidminer is a good choice. Plz refer to the built-in tutorial process for
Using the MetaCost operator for generating a better Decision Tree
4. Ensemble learning,
by trying to use several decision trees instead of using a single decision tree. Check out Bagging algorithm in rapidminer for booststrap aggregating decision tree models. In our latest release Rapidminer 7.2, Gradient Boosted TreesIngoRM :smileywink: and say hello to our favourite learners.
by combining undersampling, oversampling, and ensemble learning strategies. Most state of art learning methids for imbalanced data use a combination of defferent strategies. Choose the one that is best for you. I would recommend to consider at leaset two of the mentioned approaches in conjuctions.
We would be happy to post some additional references to the literature if you would like to follow up on this.
RapidMiner AI Hub
Automated Data Science
Training Classes & Certification
ML Algorithm Reference
Educational License Program