Random Forest keeps getting worse

MarkusW · October 2021

Hi, I'm using different Methods to compare the results using the design from this tutorial:

https://academy.rapidminer.com/learn/video/automatic-classification-of-documents

Due to the runtime I initially set Random Forest to only 10 trees and only 5-fold cross-validation (Naive Bayes and Decision Tree were run with 10-fold).

This had NB with an f-score of 66.6, DT with 67.8 and RF with 62. I attributed this worse performance to it being only 5-fold so I ran it with 10-fold.

This time Random Forest had an f-score of 46.5!

How did that happen? Do I have to set RF up differently?

BalazsBarany · October 2021

How many examples do you have? Try sampling and check with Naive Bayes and Decision Tree if the result stays the same.

With Random Forest, try 20 or 30 trees and look at the models. If they make sense, the problem might be a good match for tree base methods. If not, try other learners.

MartinLiebig · October 2021

10 trees aren't much? Maybe good/bad luck?

MarkusW · October 2021

Possible, but how the hell does 10-fold not only perform worse than 5-fold, but also below 50%?

It's a yes or no question, so 50% would be just guessing randomly on a dataset with the same number of positives and negatives!

BalazsBarany · October 2021

If the models are overfitted, you can get worse than random chance results. But usually Random Forest with enough trees is not very prone to overfitting.

You should try a higher number of trees and look at the models. If the problem is not a good fit for trees at all, you might also get bad results from Random Forest.

MarkusW · October 2021

Thanks.

I do have the issue of runtime, so what should I try, as a minimum?

I've tried with the default 100 trees but it left my laptop completely unusable for 4 hours, before I aborted it, having not even passed 5 folds.

MarkusW · October 2021

The dataset contains 9386 lines, if that's what you mean with the number of examples.

What do you mean with sampling? As I've said, I've already run on the same dataset with Naive Bayes and Decision Tree with relative success (it was about as effective as I expected given the effort).

I've looked at the Decision Tree; it just looks for the presence of certain words or phrases, so I thought that a few of the trees in Random Forest would check these same phrases/words, while other trees would check for additional factors.

BalazsBarany · October 2021

Hi!

Are you doing text mining? Do you have a very large number of attributes? Tree based models are really slow with those. Try a Support Vector Machine and optimize the C parameter, it should be faster and better.

The problem with Random Forest is that the attributes are randomly selected. If you have a few relevant words, the won't end up in many models.

Regards,
Balázs

MarkusW · October 2021

Yes, I am doing text mining (Sarcasm detection to be precise). It's the same project you've helped me with for a few times already.

Do I understand it right, that the Decision Tree doing well was as much random chance as Random Forest being so bad now? Because throughout the process it just picks random words or phrases and checks the correlation?

I think I will try an SVM, too, though it'll be harder to explain in the paper, how those work.

MarkusW · October 2021

A question to SVM: I tried just using the SVM operator the same way I've been using Naive Bayes, Decision Tree and Random Forest, overnight.

However it aborted without giving results. Is there something I have to take into consideration regarding input or output?

BalazsBarany · October 2021

The regular decision tree has access to all attributes at all stages of searching for the best decision (splits). It will find relevant attributes that are in your data, that's the entire point of decision tree based methods. So it might work well for text mining but it's really slow because it has to check all attributes as the possible decision point (and with numeric attributes it's even more complicated).

The random forest intentionally selects a random subset of attributes and examples when building the trees. The idea is that this makes models more robust, and this works in many cases. In text mining, however, you have thousands of attributes and only a few of them might be relevant for your use case. A random forest that randomly excludes these attributes from consideration will be worse than one decision tree working on all attributes.

SVM can only work with numerical attributes, but you should get a warning from RapidMiner if this doesn't work. If an operator just exits without giving results, check the connections and set a breakpoint before and after the execution (F7 and Shift+F7 or from the right-click context menu).

Regards,
Balázs

MarkusW · October 2021

Okay... Why did you recommend SVM then?

It's definetly possible that there was a warning and I simply skipped it, when going back to my laptop after starting it the night before.

BalazsBarany · October 2021

SVM is in my experience good and fast on text mining example sets (or generally on example sets with a lot of numeric attributes).

In text mining you usually only have numeric attributes. The attributes generated by Process Documents from X are numeric.
If you have remaining nominals, you need to transform them, e. g. using Nominal to Numerical.

Regards,
Balázs

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Random Forest keeps getting worse

Best Answer

Answers