"MetaCost vs Performance(Costs) operator"

Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,125   Unicorn
edited May 23 in Help

I am wondering whether there is any difference in the implementation of the loss optimization function of the MetaCost operator vs the Performance(Costs) operator.  I would not expect there to be. However, I am also seeing significant differences in outcomes when comparing a single DT learner using the Performance(Costs) operator with a cost matrix vs using the MetaCost operator with 1 iteration with an inner DT using the same cost matrix.  There are wide divergences not only in the cost outcome but also other performance metrics such as accuracy and AUC, as well as the resulting models.  See the attached example process:  

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.0.003" expanded="true" height="68" name="Sonar" origin="GENERATED_TUTORIAL" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Sonar"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.0.003" expanded="true" height="103" name="Multiply" width="90" x="313" y="34"/>
<operator activated="true" class="concurrency:cross_validation" compatibility="9.0.003" expanded="true" height="145" name="Cross Validation (MetaCost)" width="90" x="514" y="30">
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true">
<operator activated="true" class="metacost" compatibility="9.0.003" expanded="true" height="82" name="MetaCost" origin="GENERATED_TUTORIAL" width="90" x="112" y="34">
<parameter key="cost_matrix" value="[0.0 2.0;3.0 0.0]"/>
<parameter key="iterations" value="1"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true">
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.0.003" expanded="true" height="103" name="Decision Tree (MetaCost)" origin="GENERATED_TUTORIAL" width="90" x="313" y="30"/>
<connect from_port="training set" to_op="Decision Tree (MetaCost)" to_port="training set"/>
<connect from_op="Decision Tree (MetaCost)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
</process>
</operator>
<connect from_port="training set" to_op="MetaCost" to_port="training set"/>
<connect from_op="MetaCost" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="9.0.003" expanded="true" height="82" name="Performance (MetacCost base)" width="90" x="179" y="34">
<parameter key="AUC" value="true"/>
</operator>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance (MetacCost base)" to_port="labelled data"/>
<connect from_op="Performance (MetacCost base)" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance (MetacCost base)" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="performance_costs" compatibility="9.0.003" expanded="true" height="82" name="Performance (MetaCost Cost)" width="90" x="648" y="34">
<parameter key="cost_matrix" value="[0.0 2.0;3.0 0.0]"/>
<enumeration key="class_order_definition"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="9.0.003" expanded="true" height="145" name="Cross Validation (Costs)" width="90" x="514" y="238">
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true">
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.0.003" expanded="true" height="103" name="Decision Tree (root)" origin="GENERATED_TUTORIAL" width="90" x="179" y="34"/>
<connect from_port="training set" to_op="Decision Tree (root)" to_port="training set"/>
<connect from_op="Decision Tree (root)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" origin="GENERATED_TUTORIAL" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_costs" compatibility="9.0.003" expanded="true" height="82" name="Performance (Cost DT)" width="90" x="179" y="34">
<parameter key="cost_matrix" value="[0.0 2.0;3.0 0.0]"/>
<enumeration key="class_order_definition"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (Cost DT)" to_port="example set"/>
<connect from_op="Performance (Cost DT)" from_port="example set" to_port="test set results"/>
<connect from_op="Performance (Cost DT)" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="performance_binominal_classification" compatibility="9.0.003" expanded="true" height="82" name="Performance (Base DT)" width="90" x="648" y="289">
<parameter key="AUC" value="true"/>
</operator>
<connect from_op="Sonar" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Cross Validation (MetaCost)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Cross Validation (Costs)" to_port="example set"/>
<connect from_op="Cross Validation (MetaCost)" from_port="model" to_port="result 1"/>
<connect from_op="Cross Validation (MetaCost)" from_port="test result set" to_op="Performance (MetaCost Cost)" to_port="example set"/>
<connect from_op="Cross Validation (MetaCost)" from_port="performance 1" to_port="result 3"/>
<connect from_op="Performance (MetaCost Cost)" from_port="performance" to_port="result 2"/>
<connect from_op="Cross Validation (Costs)" from_port="model" to_port="result 4"/>
<connect from_op="Cross Validation (Costs)" from_port="test result set" to_op="Performance (Base DT)" to_port="labelled data"/>
<connect from_op="Cross Validation (Costs)" from_port="performance 1" to_op="Performance (Base DT)" to_port="performance"/>
<connect from_op="Performance (Base DT)" from_port="performance" to_port="result 5"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="18"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="105"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>

  @mschmitz any ideas on the underlying algorithms that would be relevant here, or other reasons these might be so different?

Brian T.
Lindon Ventures 
Data Science Consulting from Certified RapidMiner Experts
Tagged:

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,029  RM Data Scientist

    Hi @Telcontar120,

    there is a severe difference. Performance Cost is "just" a performance measure. MetaCost is an ensemble learner which is i think tuning itself to work better on the cost metric.

    From the docu:

    The MetaCost operator makes its base classifier cost-sensitive by using the cost matrix specified in the cost matrix parameter. The method used by this operator is similar to the MetaCost method described by Pedro Domingos (1999).

     

    The code for it is available here: https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/learner/meta/MetaCost.java

    Btw, @hhomburg is the author :)

     

    BR,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,125   Unicorn

    Yes, I understand that is the case and I know the difference between an ensemble and a base learner :-). 

    However, if you set the iterations of MetaCost to 1, then it should be using only one version of the inner learner, which in the example process I supplied is a DT with the same parameters as the second model which uses the same DT learner and the same cost matrix via Performance(Costs).  In that case, why would the results be so different?

    @hhomburg any ideas here?

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,125   Unicorn

    @mschmitz @hhomburg @sgenzer @Ingo Any ideas about this one?  I'm still puzzling over why the differences are so great when the iterations for MetaCost = 1.  Thanks for taking a look at it!

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.