Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Execute Python failed in a Optimization / Cross validation operator

lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
edited September 2019 in Help

Hi,

 

 

I use the "Execute Python" operator to perform a generation of dummy variables on a dataset.

I know that this function can be performed with the "Nominal to Numerical " operator or not to be performed at all.......

but I discovered that without X-validation/Optimization, the created decision tree is not the same (and its associated prediction/accuracy) when the dummy variables are generated by "Nominal to Numerical " or generated by "Execute Python" which seems to be weird.....

 

In my case, the 2 "Execute Python", which are respectively in the training and test parts of a "cross validation" operator, itself in 

an "Optimization" operator, seems to be not executed and then the process failed.

 

Here my process : 

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="391">
<parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (3)" width="90" x="179" y="391">
<parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
</operator>
<operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="112" y="544">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Play"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="optimize_parameters_grid" compatibility="7.6.001" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="313" y="85">
<list key="parameters">
<parameter key="Decision Tree.criterion" value="gain_ratio,information_gain,gini_index,accuracy"/>
<parameter key="Decision Tree.apply_pruning" value="true,false"/>
<parameter key="Decision Tree.apply_prepruning" value="true,false"/>
<parameter key="Decision Tree.maximal_depth" value="[-1.0;20;20;linear]"/>
</list>
<process expanded="true">
<operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true">
<operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="45" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Play"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="136">
<parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="136">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="313" y="136">
<parameter key="maximal_depth" value="-1"/>
</operator>
<connect from_port="training set" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (3)" width="90" x="45" y="238">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Play"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (2)" width="90" x="112" y="85">
<parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="246" y="187">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="380" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Execute Python (2)" to_port="input 1"/>
<connect from_op="Execute Python (2)" from_port="output 1" to_op="Set Role (3)" to_port="example set input"/>
<connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="391">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="340">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="715" y="391">
<list key="class_weights"/>
</operator>
<connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Execute Python (3)" to_port="input 1"/>
<connect from_op="Execute Python (3)" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Retrieve Golf" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 4"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 5"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_op="Apply Model" to_port="model"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 2" to_port="result 6"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<connect from_op="Performance" from_port="example set" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
</process>
</operator>
</process>

 

My approach seems to be futile, but maybe there is a bug on the "Execute python" operator and it will help those who use

this operator for more useful tasks.

 

Thank you for your help,

 

Regards,

 

Lionel 

 

 

 

 

 

 

 

Best Answer

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Solution Accepted

    The problem showing using your Golf dataset is that the attributes don't match.  Using breakpoints I can see that your Test data fold only contains one record (and your training is also on a small number). 

    And because you are converting to dummy variables on each side of training & testing then it's pretty likely that some attributes won't match your model as your test data might be missing important details. 

     

    This is bad practice and I recommend that you feed your preprocessing model through the RapidMiner process to work on it. 

     

    However, as you did state you wanted to use this way what you need to do is ensure that the attributes of your dataset matches the output.  You can do this with operators like Superset.   See below XML.

     

    Maybe you could also post an example of the incorrect results you're getting with the Nom to Num operator? 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="391">
    <parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (3)" width="90" x="179" y="391">
    <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="112" y="544">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="optimize_parameters_grid" compatibility="7.6.001" expanded="true" height="166" name="Optimize Parameters (Grid)" width="90" x="313" y="85">
    <list key="parameters">
    <parameter key="Decision Tree.criterion" value="gain_ratio,information_gain,gini_index,accuracy"/>
    <parameter key="Decision Tree.apply_pruning" value="true,false"/>
    <parameter key="Decision Tree.apply_prepruning" value="true,false"/>
    <parameter key="Decision Tree.maximal_depth" value="[-1.0;20;20;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="112" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    <parameter key="use_underscore_in_name" value="true"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="289">
    <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="112" y="187">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="246" y="85">
    <parameter key="maximal_depth" value="-1"/>
    </operator>
    <operator activated="true" class="remember" compatibility="7.6.001" expanded="true" height="68" name="Remember" width="90" x="313" y="187">
    <parameter key="name" value="myDataSet"/>
    </operator>
    <connect from_port="training set" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="model"/>
    <connect from_op="Decision Tree" from_port="exampleSet" to_op="Remember" to_port="store"/>
    <connect from_op="Remember" from_port="stored" to_port="through 1"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="126"/>
    <portSpacing port="sink_through 2" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (2)" width="90" x="45" y="85">
    <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="179" y="187">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="superset" compatibility="7.6.001" expanded="true" height="82" name="Superset" width="90" x="313" y="238"/>
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="380" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_port="test set" to_op="Execute Python (2)" to_port="input 1"/>
    <connect from_port="through 1" to_op="Superset" to_port="example set 2"/>
    <connect from_op="Execute Python (2)" from_port="output 1" to_op="Set Role (3)" to_port="example set input"/>
    <connect from_op="Set Role (3)" from_port="example set output" to_op="Superset" to_port="example set 1"/>
    <connect from_op="Superset" from_port="superset 1" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="105"/>
    <portSpacing port="source_through 2" spacing="21"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="87" resized="true" width="237" x="254" y="337">There should really be a replace missing values here too, but I didn't feel like adding it. :P</description>
    </process>
    </operator>
    <connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
    <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
    <connect from_op="Cross Validation" from_port="test result set" to_port="result 3"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="391">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="recall" compatibility="7.6.001" expanded="true" height="68" name="Recall" width="90" x="313" y="493">
    <parameter key="name" value="myDataSet"/>
    <description align="center" color="transparent" colored="false" width="126">This needs to happen AFTER the Optimize has run.</description>
    </operator>
    <operator activated="true" class="superset" compatibility="7.6.001" expanded="true" height="82" name="Superset (2)" width="90" x="447" y="442"/>
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="340">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="715" y="391">
    <list key="class_weights"/>
    </operator>
    <connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Execute Python (3)" to_port="input 1"/>
    <connect from_op="Execute Python (3)" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/>
    <connect from_op="Retrieve Golf" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 4"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 5"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_op="Apply Model" to_port="model"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="result 2" to_port="result 6"/>
    <connect from_op="Set Role (2)" from_port="example set output" to_op="Superset (2)" to_port="example set 1"/>
    <connect from_op="Recall" from_port="result" to_op="Superset (2)" to_port="example set 2"/>
    <connect from_op="Superset (2)" from_port="superset 1" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
    <connect from_op="Performance" from_port="performance" to_port="result 1"/>
    <connect from_op="Performance" from_port="example set" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    </process>
    </operator>
    </process>

Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @JEdward

     

    Thanks you for your response and your advices.

    1. In deed, by performing the generation of dummy variables in the "Optimization" operator or in the main window of the process (after the training dataset), the process is running well. So it was not a problem with the "Execute Pyton" operator.

    I did this, because in a previous topic, we told me that "the conversion into dummy variables need to be done inside of x-val to do it right".

     

    2. Concerning the differences between the 2 methods of generation of dummies variables : 

     2.a Here the process using "Execute Python" : 

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="179" y="34">
    <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="136">
    <parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (2)" width="90" x="179" y="136">
    <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook','Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="391">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="136">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="238">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="447" y="34">
    <parameter key="criterion" value="gini_index"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="85">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="715" y="85">
    <list key="class_weights"/>
    </operator>
    <connect from_op="Retrieve Golf" from_port="output" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Execute Python (2)" to_port="input 1"/>
    <connect from_op="Execute Python (2)" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/>
    <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
    <connect from_op="Performance" from_port="performance" to_port="result 1"/>
    <connect from_op="Performance" from_port="example set" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    2.b The associated results (python)

     

    Dummies_variables_Python.png

     

    2.c Here the (same) process using "Numerical to numeral" : 

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="313" y="238">
    <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="136">
    <parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
    </operator>
    <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python (2)" width="90" x="313" y="289">
    <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; data = pd.get_dummies(data,columns = ['Outlook','Wind'] )&#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="136">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="447" y="34">
    <parameter key="criterion" value="gini_index"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="581" y="85">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="715" y="85">
    <list key="class_weights"/>
    </operator>
    <connect from_op="Retrieve Golf" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
    <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
    <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
    <connect from_op="Performance" from_port="performance" to_port="result 1"/>
    <connect from_op="Performance" from_port="example set" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    3.d The associated results (Nominal to Numerical)

    Dummies_variables_RM.png

     

     

    How can we explain this behaviour ?

    Thank you.

     

    Regards,

     

    Lionel

     

     

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    TLDR: Change your datatype from Integer to Real for Temperature & Humidity. 

     

    Well this was interesting!

     

    This is caused because your Execute Python process is parsing the numbers and changing Temperature & Humidity from Integer into Real data types.  For some reason the Real datatype is performing significantly better than the Integer datatype for this dataset and I have absolutely no idea why.  The two models produced are different in that the Real Decision Tree has a final split using Temperature 71, but the Integer Decision Tree uses Outlook = Rain as the final split.  So it's probably related to the way splits are calculated.  Anyone want to look at the DT code and see if they can spot why this is behaving like this? 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf (2)" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    <parameter key="use_underscore_in_name" value="true"/>
    </operator>
    <operator activated="true" class="numerical_to_real" compatibility="7.6.001" expanded="true" height="82" name="Numerical to Real" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Humidity|Temperature"/>
    </operator>
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset (2)" width="90" x="45" y="136">
    <parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    <parameter key="use_underscore_in_name" value="true"/>
    </operator>
    <operator activated="true" class="numerical_to_real" compatibility="7.6.001" expanded="true" height="82" name="Numerical to Real (2)" width="90" x="313" y="136">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Humidity|Temperature"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="447" y="136">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (4)" width="90" x="447" y="34">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Real Decision Tree" width="90" x="581" y="34">
    <parameter key="criterion" value="gini_index"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="715" y="85">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="849" y="34">
    <list key="class_weights"/>
    </operator>
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf (3)" width="90" x="45" y="289">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (3)" width="90" x="179" y="289">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    <parameter key="use_underscore_in_name" value="true"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (5)" width="90" x="447" y="289">
    <parameter key="attribute_name" value="Play"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Integer Decision Tree" width="90" x="581" y="340">
    <parameter key="criterion" value="gini_index"/>
    </operator>
    <connect from_op="Retrieve Golf (2)" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Numerical to Real" to_port="example set input"/>
    <connect from_op="Numerical to Real" from_port="example set output" to_op="Set Role (4)" to_port="example set input"/>
    <connect from_op="Retrieve Golf-Testset (2)" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
    <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Numerical to Real (2)" to_port="example set input"/>
    <connect from_op="Numerical to Real (2)" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
    <connect from_op="Set Role (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Set Role (4)" from_port="example set output" to_op="Real Decision Tree" to_port="training set"/>
    <connect from_op="Real Decision Tree" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <connect from_op="Apply Model (2)" from_port="model" to_port="result 3"/>
    <connect from_op="Performance (2)" from_port="performance" to_port="result 1"/>
    <connect from_op="Performance (2)" from_port="example set" to_port="result 2"/>
    <connect from_op="Retrieve Golf (3)" from_port="output" to_op="Nominal to Numerical (3)" to_port="example set input"/>
    <connect from_op="Nominal to Numerical (3)" from_port="example set output" to_op="Set Role (5)" to_port="example set input"/>
    <connect from_op="Set Role (5)" from_port="example set output" to_op="Integer Decision Tree" to_port="training set"/>
    <connect from_op="Integer Decision Tree" from_port="model" to_port="result 4"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="189"/>
    <portSpacing port="sink_result 4" spacing="21"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="648" y="206">Note the difference between the two trees.</description>
    </process>
    </operator>
    </process>

     

     

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    And lastly here's your original process changed so it uses the Real conversion and also uses the Nom to Numeric in the RapidMiner way.  (So the preprocessing model created in training is passed through to the Test part of the subprocess).

     

    However, I would advise being careful about using accuracy as the performance measure here as the Decision Tree produced doesn't really classify items very well, despite the high accuracy it's actually just classifying every day as golf day.  (Whilst this is might be true in US politics, it's not necessarily true in our dataset). 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="391">
    <parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
    </operator>
    <operator activated="true" class="numerical_to_real" compatibility="7.6.001" expanded="true" height="82" name="Numerical to Real (2)" width="90" x="179" y="391"/>
    <operator activated="false" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="112" y="544">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Golf"/>
    </operator>
    <operator activated="true" class="numerical_to_real" compatibility="7.6.001" expanded="true" height="82" name="Numerical to Real" width="90" x="179" y="85"/>
    <operator activated="true" class="optimize_parameters_grid" compatibility="7.6.001" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="313" y="85">
    <list key="parameters">
    <parameter key="Decision Tree.criterion" value="gain_ratio,information_gain,gini_index,accuracy"/>
    <parameter key="Decision Tree.apply_pruning" value="true,false"/>
    <parameter key="Decision Tree.apply_prepruning" value="true,false"/>
    <parameter key="Decision Tree.maximal_depth" value="[-1.0;20;20;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
    <parameter key="use_local_random_seed" value="true"/>
    <process expanded="true">
    <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Play"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="179" y="136">
    <parameter key="criterion" value="accuracy"/>
    <parameter key="apply_pruning" value="false"/>
    <parameter key="apply_prepruning" value="false"/>
    </operator>
    <operator activated="true" class="group_models" compatibility="7.6.001" expanded="true" height="103" name="Group Models" width="90" x="313" y="34"/>
    <connect from_port="training set" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
    <connect from_op="Decision Tree" from_port="model" to_op="Group Models" to_port="models in 2"/>
    <connect from_op="Group Models" from_port="model out" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
    <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="514" y="340">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="715" y="391">
    <list key="class_weights"/>
    </operator>
    <connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Numerical to Real (2)" to_port="example set input"/>
    <connect from_op="Numerical to Real (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Retrieve Golf" from_port="output" to_op="Numerical to Real" to_port="example set input"/>
    <connect from_op="Numerical to Real" from_port="example set output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 4"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 5"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_op="Apply Model" to_port="model"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="result 2" to_port="result 6"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
    <connect from_op="Performance" from_port="performance" to_port="result 1"/>
    <connect from_op="Performance" from_port="example set" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    <portSpacing port="sink_result 6" spacing="0"/>
    <portSpacing port="sink_result 7" spacing="0"/>
    </process>
    </operator>
    </process>
  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @JEdward

     

    First, thanks you for spending time to perform your analysis and to update my process.

     

    I understand well that in this special case, accuracy is not a relevant performance measure.(I choose to post "by default" the performance window to "illustrate" the difference between the results of the 2 methods).

     

    Until this mysterious behaviour is clarified, in practice, what did you recommend when using decision trees (and maybe other algorithms): 

     - Systematically, using the "Numerical to real" operator on the datasets to work with real values ?

     - Systematically, execute a process twice (one with integer values / one with real values) to select the best model (because what is true in the specific Golf case, can be false in a other case) ?

     - Do nothing, because in a real case study, a parametric optimization is performed, and the differences between "real results" and "integer results" will be  "masked" (almost totally) by the optimization process ?

     - Maybe an other approach ?

     

    and to conclude : "every day as golf day" : Maybe this outdoor sport is good for body and mind and make the best decisions...........(to medidate).

     

    Thanks you for your responses,

     

    Best regards,

     

    Lionel

     

     

     

     

     

     

     

     

  • gmeiergmeier Employee, Member Posts: 25 RM Engineering

    Hi,

     

    the decision tree is not caring about real or integer values, it is all a double array for the tree. What is influencing it is the order of the attributes for a very simple reason: When it searches for the best split and there are two attributes with the same benefit, then it takes the first attribute with this benefit.

     

    When you look at the "Integer Decision Tree" and "Real Decision Tree" results in JEdward's process then you see that the difference at the lowest node leads to the same "purity" of the split (3 pure yes, 2 no with one wrong). When you put a breakpoint before the "Real Decision Tree" and "Integer Decision Tree" operators in the process, you see that for "Real Decision Tree" the attribute "Temperature" is the first, while for "Integer Decision Tree" the attribute "Outlook_rain" is first.

     

    When you apply an "Type1 To Type2"-operator then the order of the attributes might change, in particular if you only change some of the attribute types.

     

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @gmeier,

     

    Thanks you for your explanations about the behaviour of the DT in this case.

    Now the causes of these mysterious results are clear for me.

     

    Regards,

     

    Lionel

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    Thanks!  That clears it up.  @gmeier

     

    It also means that I shall play around with my own future trees by throwing in a loop with Reorder Attributes to put them in random order, optimized, or (more likely) in order of importance. 

     

     

Sign In or Register to comment.