The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Random Forest keeps getting worse
Hi, I'm using different Methods to compare the results using the design from this tutorial:
Due to the runtime I initially set Random Forest to only 10 trees and only 5-fold cross-validation (Naive Bayes and Decision Tree were run with 10-fold).
This had NB with an f-score of 66.6, DT with 67.8 and RF with 62. I attributed this worse performance to it being only 5-fold so I ran it with 10-fold.
This time Random Forest had an f-score of 46.5!
How did that happen? Do I have to set RF up differently?
Tagged:
0
Best Answer
-
BalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 UnicornHow many examples do you have? Try sampling and check with Naive Bayes and Decision Tree if the result stays the same.
With Random Forest, try 20 or 30 trees and look at the models. If they make sense, the problem might be a good match for tree base methods. If not, try other learners.0
Answers
Dortmund, Germany
You should try a higher number of trees and look at the models. If the problem is not a good fit for trees at all, you might also get bad results from Random Forest.
Are you doing text mining? Do you have a very large number of attributes? Tree based models are really slow with those. Try a Support Vector Machine and optimize the C parameter, it should be faster and better.
The problem with Random Forest is that the attributes are randomly selected. If you have a few relevant words, the won't end up in many models.
Regards,
Balázs
The random forest intentionally selects a random subset of attributes and examples when building the trees. The idea is that this makes models more robust, and this works in many cases. In text mining, however, you have thousands of attributes and only a few of them might be relevant for your use case. A random forest that randomly excludes these attributes from consideration will be worse than one decision tree working on all attributes.
SVM can only work with numerical attributes, but you should get a warning from RapidMiner if this doesn't work. If an operator just exits without giving results, check the connections and set a breakpoint before and after the execution (F7 and Shift+F7 or from the right-click context menu).
Regards,
Balázs
In text mining you usually only have numeric attributes. The attributes generated by Process Documents from X are numeric.
If you have remaining nominals, you need to transform them, e. g. using Nominal to Numerical.
Regards,
Balázs