{Solved}Strange problem with Decision trees

blueearthblueearth Member Posts: 42 Contributor II
edited November 2018 in Help
Hi, i have a really strange problem with decision tree models on my data
my data range is: Reston (7), Zaire (7), Sudan (4), Bundibugyo (2), Cote d'Ivoire (1)
but when i run decision tree model i get strange results
for example i got a model which was correct in image but when i switched into text perspective i see this

Tree
CountofAlaGly > 5: Zaire {Reston=0, Zaire=6, Sudan=0, Bundibugyo=0, Cote d'Ivoire=0}
CountofAlaGly = 5
|   CountofIleAsn > 2.500: Cote d'Ivoire {Reston=0, Zaire=0, Sudan=0, Bundibugyo=0, Cote d'Ivoire=2}
|   CountofIleAsn = 2.500
|   |   CountofIleAsn > 1.500: Sudan {Reston=0, Zaire=0, Sudan=2, Bundibugyo=0, Cote d'Ivoire=0}
|   |   CountofIleAsn = 1.500
|   |   |   CountofLeuThr > 4.500: Reston {Reston=8, Zaire=0, Sudan=0, Bundibugyo=0, Cote d'Ivoire=0}
|   |   |   CountofLeuThr = 4.500: Bundibugyo {Reston=0, Zaire=0, Sudan=0, Bundibugyo=3, Cote d'Ivoire=0}

its so strange....as you can see model has mixed up... while i have 7 Zaire model says i just have six and while i have just one  Cote d'Ivoire model is presenting two  Cote d'Ivoire  and so on
can some one explain what should i do?

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hm, might be a problem with the internal nominal mapping. Does the tree work correctly apart from the strange display (do classification results make sense)?
    Which tree operator are you using?

    Best, Marius
  • blueearthblueearth Member Posts: 42 Contributor II
    Im using 4 operators on 4 criteria,the operators are : Decision Tree, parallel, random forest and stump and some of them were not able to draw a tree with all of 5
    classes and just had 4 of them
    I cant say if this classification is right or wrong it might make sense, it be discussed but needs lab confirm which is impossible for me,
    I have used a cross validation to gain average  performances ..
    Does it affect on my other operators such as SVM and Baysian? is this problem about my data ??
    how should i solve this problem ?
    Thanks alot

  • blueearthblueearth Member Posts: 42 Contributor II
    Please....any suggestion for this problem?? ??? ??? ???
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    I suppose with "criteria" you mean attributes?
    However, Decision Tree and Decision Tree (Parallel) are using the same algorithms, the parallel tree just uses several threads (and thus several cpus) to calculate the tree.

    What about your 5 classes? Are they equally sized, or does one of them contain significantly less examples than the others? If yes, it may be possible that the trees just drop the class because they don't consider it worth be be considered at all.

    Of course the creation of a decision tree is totally independent of an SVM or Naive Bayes - how should it affect an SVM?

    So, all in all I need a bit more information about the data, and as always it would be a good idea to post your process setup - you'll find a description on how to ask good questions in the post linked in my signature.

    Best, Marius
  • blueearthblueearth Member Posts: 42 Contributor II
    sorry by criteria i mean criterion
    and about the size i have to tell i explained in my first post...the sizes are
    Reston (7), Zaire (7), Sudan (4), Bundibugyo (2), Cote d'Ivoire (1)
    here is my code but it not the whole processes... i had to delete some operators cause my code was so long that couldn't be post here
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
       <process expanded="true" height="1342" width="1572">
         <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Chi Squared" width="90" x="45" y="30">
           <parameter key="repository_entry" value="../../Results/Attribute Weighting/Chi Squared"/>
         </operator>
         <operator activated="true" class="replace_missing_values" compatibility="5.2.000" expanded="true" height="94" name="Replace Missing Values" width="90" x="179" y="30">
           <list key="columns"/>
         </operator>
         <operator activated="true" class="rename_by_generic_names" compatibility="5.2.008" expanded="true" height="76" name="Rename by Generic Names" width="90" x="45" y="165">
           <parameter key="attribute_filter_type" value="single"/>
           <parameter key="attribute" value="Accession"/>
           <parameter key="generic_name_stem" value="Chi Squared"/>
         </operator>
         <operator activated="true" class="multiply" compatibility="5.2.008" expanded="true" height="130" name="Multiply" width="90" x="45" y="255"/>
         <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="DT Gain Ratio" width="90" x="380" y="30">
           <parameter key="use_local_random_seed" value="true"/>
           <process expanded="true" height="560" width="342">
             <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree" width="90" x="150" y="45"/>
             <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
             <connect from_op="Decision Tree" from_port="model" to_port="model"/>
             <portSpacing port="source_training" spacing="0"/>
             <portSpacing port="sink_model" spacing="0"/>
             <portSpacing port="sink_through 1" spacing="0"/>
           </process>
           <process expanded="true" height="560" width="342">
             <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
               <list key="application_parameters"/>
             </operator>
             <operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
             <connect from_port="model" to_op="Apply Model" to_port="model"/>
             <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
             <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
             <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
             <portSpacing port="source_model" spacing="0"/>
             <portSpacing port="source_test set" spacing="0"/>
             <portSpacing port="source_through 1" spacing="0"/>
             <portSpacing port="sink_averagable 1" spacing="0"/>
             <portSpacing port="sink_averagable 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="x_validation" compatibility="5.2.008" expanded="true" height="112" name="DT Info Gain" width="90" x="514" y="30">
           <parameter key="use_local_random_seed" value="true"/>
           <process expanded="true" height="506" width="399">
             <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree (2)" width="90" x="112" y="30">
               <parameter key="criterion" value="information_gain"/>
             </operator>
             <connect from_port="training" to_op="Decision Tree (2)" to_port="training set"/>
             <connect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
             <portSpacing port="source_training" spacing="0"/>
             <portSpacing port="sink_model" spacing="0"/>
             <portSpacing port="sink_through 1" spacing="0"/>
           </process>
           <process expanded="true" height="506" width="399">
             <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model (2)" width="90" x="45" y="30">
               <list key="application_parameters"/>
             </operator>
             <operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance (2)" width="90" x="226" y="30"/>
             <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
             <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
             <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
             <connect from_op="Performance (2)" from_port="performance" to_port="averagable 1"/>
             <portSpacing port="source_model" spacing="0"/>
             <portSpacing port="source_test set" spacing="0"/>
             <portSpacing port="source_through 1" spacing="0"/>
             <portSpacing port="sink_averagable 1" spacing="0"/>
             <portSpacing port="sink_averagable 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="x_validation" compatibility="5.2.008" expanded="true" height="112" name="DT Gini Index" width="90" x="380" y="210">
           <parameter key="use_local_random_seed" value="true"/>
           <process expanded="true" height="506" width="399">
             <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree (3)" width="90" x="179" y="48">
               <parameter key="criterion" value="gini_index"/>
             </operator>
             <connect from_port="training" to_op="Decision Tree (3)" to_port="training set"/>
             <connect from_op="Decision Tree (3)" from_port="model" to_port="model"/>
             <portSpacing port="source_training" spacing="0"/>
             <portSpacing port="sink_model" spacing="0"/>
             <portSpacing port="sink_through 1" spacing="0"/>
           </process>
           <process expanded="true" height="506" width="399">
             <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model (3)" width="90" x="45" y="30">
               <list key="application_parameters"/>
             </operator>
             <operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance (3)" width="90" x="226" y="30"/>
             <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
             <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
             <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
             <connect from_op="Performance (3)" from_port="performance" to_port="averagable 1"/>
             <portSpacing port="source_model" spacing="0"/>
             <portSpacing port="source_test set" spacing="0"/>
             <portSpacing port="source_through 1" spacing="0"/>
             <portSpacing port="sink_averagable 1" spacing="0"/>
             <portSpacing port="sink_averagable 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="x_validation" compatibility="5.2.008" expanded="true" height="112" name="DT Accuracy" width="90" x="514" y="210">
           <parameter key="use_local_random_seed" value="true"/>
           <process expanded="true" height="506" width="399">
             <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree (4)" width="90" x="158" y="27">
               <parameter key="criterion" value="accuracy"/>
             </operator>
             <connect from_port="training" to_op="Decision Tree (4)" to_port="training set"/>
             <connect from_op="Decision Tree (4)" from_port="model" to_port="model"/>
             <portSpacing port="source_training" spacing="0"/>
             <portSpacing port="sink_model" spacing="0"/>
             <portSpacing port="sink_through 1" spacing="0"/>
           </process>
           <process expanded="true" height="506" width="399">
             <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model (4)" width="90" x="45" y="30">
               <list key="application_parameters"/>
             </operator>
             <operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance (4)" width="90" x="226" y="30"/>
             <connect from_port="model" to_op="Apply Model (4)" to_port="model"/>
             <connect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/>
             <connect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/>
             <connect from_op="Performance (4)" from_port="performance" to_port="averagable 1"/>
             <portSpacing port="source_model" spacing="0"/>
             <portSpacing port="source_test set" spacing="0"/>
             <portSpacing port="source_through 1" spacing="0"/>
             <portSpacing port="sink_averagable 1" spacing="0"/>
             <portSpacing port="sink_averagable 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="store" compatibility="5.2.008" expanded="true" height="60" name="Store (26)" width="90" x="246" y="1065">
           <parameter key="repository_entry" value="../../Results/Tree Induction/Tree Performance/chi squared  Random Gain Ratio"/>
         </operator>
         <operator activated="true" class="store" compatibility="5.2.008" expanded="true" height="60" name="Store (30)" width="90" x="246" y="1200">
           <parameter key="repository_entry" value="../../Results/Tree Induction/Tree Performance/chi squared Random Gini index"/>
         </operator>
         <operator activated="true" class="log" compatibility="5.2.008" expanded="true" height="346" name="Log" width="90" x="715" y="30">
           <list key="log">
             <parameter key="DT Accuracy" value="operator.DT Accuracy.value.performance"/>
             <parameter key="DT Gain Ratio" value="operator.DT Gain Ratio.value.performance"/>
             <parameter key="DT Gini Index" value="operator.DT Gini Index.value.performance"/>
             <parameter key="DT Info Gain" value="operator.DT Info Gain.value.performance"/>
             <parameter key="DT Parallel Accuracy" value="operator.DT Parallel Accuracy.value.performance"/>
             <parameter key="DT Parallel Gain Ratio" value="operator.DT Parallel Gain Ratio.value.performance"/>
             <parameter key="DT Parallel Gini Index" value="operator.DT Parallel Gini Index.value.performance"/>
             <parameter key="DT Parallel Info Gain" value="operator.DT Parallel Info Gain.value.performance"/>
             <parameter key="DT Stump Accuracy" value="operator.Stump Accuracy.value.performance"/>
             <parameter key="DT Stump Gain Ratio" value="operator.Stump Gain Ratio.value.performance"/>
             <parameter key="DT Stump Gini Index" value="operator.Stump Gini Index.value.performance"/>
             <parameter key="DT Stump Info Gain" value="operator.Stump Information Gain.value.performance"/>
             <parameter key="DT Random Forest Accuracy" value="operator.Random Accuracy.value.performance"/>
             <parameter key="DT Random Forest Gain Ratio" value="operator.Random Gain Ratio.value.performance"/>
             <parameter key="DT Random Forest Gini Index" value="operator.Random Gini Index.value.performance"/>
             <parameter key="DT Random Forest Info Gain" value="operator.Random Info Gain.value.performance"/>
             <parameter key="Data Base" value="operator.Rename by Generic Names.parameter.generic_name_stem"/>
           </list>
         </operator>
         <operator activated="true" class="log_to_data" compatibility="5.2.008" expanded="true" height="364" name="Log to Data" width="90" x="916" y="30">
           <parameter key="log_name" value="Log"/>
         </operator>
         <operator activated="true" class="store" compatibility="5.2.008" expanded="true" height="60" name="Store" width="90" x="782" y="480">
           <parameter key="repository_entry" value="../../Results/Log/log 1"/>
         </operator>
         <connect from_op="Chi Squared" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
         <connect from_op="Replace Missing Values" from_port="example set output" to_op="Rename by Generic Names" to_port="example set input"/>
         <connect from_op="Rename by Generic Names" from_port="example set output" to_op="Multiply" to_port="input"/>
         <connect from_op="Multiply" from_port="output 1" to_op="DT Gain Ratio" to_port="training"/>
         <connect from_op="Multiply" from_port="output 2" to_op="DT Info Gain" to_port="training"/>
         <connect from_op="Multiply" from_port="output 3" to_op="DT Gini Index" to_port="training"/>
         <connect from_op="Multiply" from_port="output 4" to_op="DT Accuracy" to_port="training"/>
         <connect from_op="DT Gain Ratio" from_port="averagable 1" to_op="Log" to_port="through 2"/>
         <connect from_op="DT Info Gain" from_port="averagable 1" to_op="Log" to_port="through 1"/>
         <connect from_op="DT Gini Index" from_port="averagable 1" to_op="Log" to_port="through 3"/>
         <connect from_op="DT Accuracy" from_port="averagable 1" to_op="Log" to_port="through 4"/>
         <connect from_op="Log" from_port="through 1" to_op="Log to Data" to_port="through 1"/>
         <connect from_op="Log" from_port="through 2" to_op="Log to Data" to_port="through 2"/>
         <connect from_op="Log" from_port="through 3" to_op="Log to Data" to_port="through 3"/>
         <connect from_op="Log" from_port="through 4" to_op="Log to Data" to_port="through 4"/>
         <connect from_op="Log" from_port="through 5" to_op="Log to Data" to_port="through 5"/>
         <connect from_op="Log" from_port="through 6" to_op="Log to Data" to_port="through 6"/>
         <connect from_op="Log" from_port="through 7" to_op="Log to Data" to_port="through 7"/>
         <connect from_op="Log" from_port="through 8" to_op="Log to Data" to_port="through 8"/>
         <connect from_op="Log" from_port="through 9" to_op="Log to Data" to_port="through 9"/>
         <connect from_op="Log" from_port="through 10" to_op="Log to Data" to_port="through 10"/>
         <connect from_op="Log" from_port="through 11" to_op="Log to Data" to_port="through 11"/>
         <connect from_op="Log" from_port="through 12" to_op="Log to Data" to_port="through 12"/>
         <connect from_op="Log" from_port="through 13" to_op="Log to Data" to_port="through 13"/>
         <connect from_op="Log" from_port="through 14" to_op="Log to Data" to_port="through 14"/>
         <connect from_op="Log" from_port="through 15" to_op="Log to Data" to_port="through 15"/>
         <connect from_op="Log" from_port="through 16" to_op="Log to Data" to_port="through 16"/>
         <connect from_op="Log to Data" from_port="exampleSet" to_op="Store" to_port="input"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
       </process>
     </operator>
    </process>

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    blueearth wrote:

    sorry by criteria i mean criterion
    Ah, so you used the right word, sorry. Trying different criteria makes totally sense. To make your life easier, you should have a look at the Loop Parameters (Grid) operator - it allows you to automatically try different values for a parameter (e.g. different decision tree criteria). But read this only as a side note, your general process setup is looking fine.

    Concerning the differences in text view/graphical view, you can test which of the trees is used in the end by applying the tree to a piece of data and see according to which of the trees the examples are classified. If you can post the results of that, this would indeed help us a lot to fix the problem.


    Probably the class which is not part of some trees is Cote d'Ivoire, since it makes only 5% of the data, and probably the tree creation algorithm did non consider it large enough to create a branch for it. The default Decision Tree e.g. has a lot of parameters which control the growing of the tree, maybe if you play around with them, the missing class will appear. But be careful, a bad choice of parameter settings can cause the tree to be too specialized on the training data ("overfitting") or to be too general. As always, creating good models is a process of trial and error and of optimization. Also here the Loop Parameters or Optimize Parameters operator will help you.

    Hope this helps!

    Happy Mining,
    Marius
Sign In or Register to comment.