No decision tree created with parameter criterion to "gini_index"

lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 747   Unicorn
edited December 2018 in Help

Good morning,

 

I used the "Decision Tree" operator to create a model with a training dataset.

With parameter "criterion" to "gini_index" no decision tree is created on the results : The differents attributes are not taken into account.

When the parameter "criterion " is "accuracy", or "gain-ratio" or "information_gain", the decision trees are good created.

 

My training dataset and scoreset are in attached files

 

Here my process in xml : 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Training" width="90" x="112" y="34">
<parameter key="repository_entry" value="//DataMiningForTheMasses/data/Chapter10DataSet_Training"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
<parameter key="attribute_name" value="User_ID"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="380" y="34">
<parameter key="attribute_name" value="eReader_Adoption"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="514" y="34">
<parameter key="criterion" value="gini_index"/>
<parameter key="maximal_depth" value="20"/>
<parameter key="apply_pruning" value="true"/>
<parameter key="confidence" value="0.25"/>
<parameter key="apply_prepruning" value="true"/>
<parameter key="minimal_gain" value="0.1"/>
<parameter key="minimal_leaf_size" value="2"/>
<parameter key="minimal_size_for_split" value="4"/>
<parameter key="number_of_prepruning_alternatives" value="3"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Scoring" width="90" x="112" y="238">
<parameter key="repository_entry" value="//DataMiningForTheMasses/data/Chapter10DataSet_Scoring"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="238">
<parameter key="attribute_name" value="User_ID"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="136">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<connect from_op="Training" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Scoring" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

Is it a bug ?

 

Can you help me ?

 

Thank you

 

Lionel

 

 

 

Tagged:
sgenzer

Best Answers

  • earmijoearmijo Posts: 263   Unicorn
    Solution Accepted

    Try unchecking the setting Apply Pre-pruning 

     

    Screen Shot 2017-11-12 at 3.25.20 PM.png

  • earmijoearmijo Posts: 263   Unicorn
    Solution Accepted

    Let me add a couple of sentences to Thomas_Ott's answer. I was confused myself when I started using RapidMiner. 

     

    You can find a nice and clear explanation of both pruning and pre-pruning here:

     

    Machine Learning: Pruning Decision Trees

     

    You should experiment in your process with all the variations. 

     

    Pre-pruning (early stopping): You stop splitting if no significant benefit results from an additional split.

    Pruning (post-pruning): You keep splitting until you reach the desired number of levels (depth = the main measure of complexity of the tree) but you try to simplify the tree afterwards. 

    Neither Pre-pruning nor Pruning : Try it. The tree will grow symmetrically until reaching the desired number of levels (depth).

     

    IF processing time is not an issue, there is no reason to ever use the pre-pruning option. In .the worst case, you'll end up with the same performance metric, but there is a chance (real as your example illustrates) that you'll end up doing better with (post-pruning).

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    @lionelderkrikor you have to also understand that the criterion all have different ways of the splitting the dataset into a tree. It might be that gini_index is not a good criteron to split your data. 

    sgenzer
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 747   Unicorn

    Hi,

     

    earmijo

     

    By unchecking Apply Pre-pruning, a decision tree is good created in my case.

    I'm beginner in RM and data-science : Can you explain me what is the goal of checking "Pre-pruning" ? In which case(s) must I check (or not) this option.

    Because in my case, when checked (and all related parameters set to the default value), there is only one node with as conclusion the class (it is a four class label attribute problem) which is in majority in the training set (refer attached file). So when applied, this model predict this one unique class to the entire score data set.

     

     

    Thanks you

     

    Lionel

     

     

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn
    Pruning and Pre-pruning are ways to reduce the overall complexity of the tree. The more complex the tree gets, the more it can overfit your data. Decision Trees are notorious for overfitting (or being abused to overfit). Pruning helps reducing the possibly (not eliminate) of overfitting.
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 747   Unicorn

    Hi Thomas,

     

    Thank you for your explanation. I understand better the role of these options.

    Regards,

     

    Lionel

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 747   Unicorn

    Hi @earmijo

     

    Thank you for your feedback and your ressources about decision trees.

    If I understand, I must be very careful when using decision trees : 

    I have to try all combinaisons [criterion - no apply /apply pruning  - no apply / apply prepruning] and

    perform an evaluation of the accuracy of the created models using a split validation to select the best model.

     

    Regards,

     

    Lionel

    sgenzer
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,107  RM Data Scientist

    Hi,

     

    i would be careful with a simple split-validation and rather use a X-Validation with a proper hold out set.

     

    Best,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzer
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 747   Unicorn

    Hi @mschmitz,

     

    Thank you for your advise : I'll use a X-Validation on the models.

     

    Regards,

     

    Lionel

    sgenzer
Sign In or Register to comment.