Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Heritage Health: problems creating a useable Random Forest model

joei005joei005 Member Posts: 3 Contributor I
edited November 2018 in Help
Hello,

I am having problems developing a useable Random Forest model in the RapidMiner GUI.

The dataset is from the Heritage Healthcare contest. It has approximately 144 attributes and over 70k examples.

The datatypes are mostly numeric and binomial. The label is numeric.

I am new to RapidMiner GUI and am trying to create a simple Random Forest model.

The process is straignt-forward. It reads in a .csv files, set the roles, discretes the numeric label using 10 bins, splits the process into modeling and validation and writes out the model.

When I initially ran the process, all the trees contained one node with a range for the predicted value of negative infinity to 0.278.

When I turned off pruning and pre-pruning, the process failed with an error message of "cannot clone example set".

When I turned off pre-prunning BUT turned on prunning, the process didn't fail but didn't produce better results. When I swithed the algorithm type to  gini_varinace, the model produced trees with multiple nodes.

However, when I checked the performance of the model from the validation process, the model predicts only the range negative infirnity to 0.287. The performance operatior indicates that this gives an 84% performance.

Do you know how to modify the model so that more ranges are used in the prediction?

I lowered the gain needed to create a new node to 0.05 and decreased the confidence level from 0.25 to 0.05.

Thanks!

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    The random forest has a lot of parameters which want to be optimized, as always in data mining patience is your friend :) The Optimize Parameters or Loop Parameters in combination with a log operator will greatly ease the job of finding good parameters. In addition you may want to try different implementation of the Random Forest, such as W-Random Forest from the Weka extension, and also try completely different algorithms such as SVM, as a quick shot maybe Naive Bayes etc.

    Just experiment with the possibilities ;)

    Best, Marius
Sign In or Register to comment.