image

🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

Need a working example of Find Threshold (Meta) operator in RapidMiner

kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 290   Unicorn
edited November 2018 in Help
I've been working with text classification processes in RapidMiner and I can't figure out the proper way to use Find Threshold (Meta) operator for multiclass classification which seems to be the closest one to implement Threshold family operators used for binary classification.

I am using k-NN models and have 11 different classes and a corpus of about 300-500 text documents as test dataset.

Specifically, I don't see any impact of putting a learner inside the operator since performance values are always the same, whether I do assign any weights to the classes or not. Moreover, there's no explanation what are the weights of classes are. And moreover, I don't see any way to extract (possibly) generated thresholds as the output of this operator in order to apply them to the model. And there's no RapidMiner documentation entry for this operator at all.

Does anyone have a working example of Find Threshold (Meta) operator so far?

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,053  RM Data Scientist
    Hi there,

    i never used the meta one. What is the reason not to use the standard one?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 574   Unicorn
    I actually don't use either Find Threshold operator as I like to also produce a table showing the various results & have flexibility to choose more than just misclassification costs. 

    Instead I use Optimise Parameters combined with Create Threshold to test various options for the threshold and select the one that delivers the best performance. 

    Here is a short version of what I use:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.4.000" expanded="true" height="60" name="Retrieve Ripley-Set" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Ripley-Set"/>
          </operator>
          <operator activated="true" class="nominal_to_binominal" compatibility="6.4.000" expanded="true" height="94" name="Nominal to Binominal" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="label"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="logistic_regression" compatibility="6.4.000" expanded="true" height="94" name="Logistic Regression" width="90" x="313" y="30"/>
          <operator activated="true" class="apply_model" compatibility="6.4.000" expanded="true" height="76" name="Apply Model" width="90" x="447" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_binominal_classification" compatibility="6.4.000" expanded="true" height="76" name="Original Performance" width="90" x="581" y="30">
            <parameter key="main_criterion" value="kappa"/>
            <parameter key="classification_error" value="true"/>
            <parameter key="kappa" value="true"/>
            <parameter key="precision" value="true"/>
            <parameter key="recall" value="true"/>
            <parameter key="lift" value="true"/>
            <parameter key="fallout" value="true"/>
            <parameter key="f_measure" value="true"/>
            <parameter key="false_positive" value="true"/>
            <parameter key="false_negative" value="true"/>
            <parameter key="true_positive" value="true"/>
            <parameter key="true_negative" value="true"/>
            <parameter key="sensitivity" value="true"/>
            <parameter key="specificity" value="true"/>
            <parameter key="youden" value="true"/>
            <parameter key="positive_predictive_value" value="true"/>
            <parameter key="negative_predictive_value" value="true"/>
            <parameter key="skip_undefined_labels" value="false"/>
            <parameter key="use_example_weights" value="false"/>
          </operator>
          <operator activated="true" class="optimize_parameters_grid" compatibility="6.4.000" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="648" y="165">
            <list key="parameters">
              <parameter key="TryThreshold.threshold" value="[0.0;1.0;20;linear]"/>
            </list>
            <process expanded="true">
              <operator activated="true" class="create_threshold" compatibility="6.4.000" expanded="true" height="60" name="TryThreshold" width="90" x="45" y="165">
                <parameter key="threshold" value="1.0"/>
                <parameter key="first_class" value="1"/>
                <parameter key="second_class" value="0"/>
              </operator>
              <operator activated="true" class="apply_threshold" compatibility="6.4.000" expanded="true" height="76" name="Apply Threshold (2)" width="90" x="179" y="30"/>
              <operator activated="true" class="performance_binominal_classification" compatibility="6.4.000" expanded="true" height="76" name="Best Threshold" width="90" x="313" y="30">
                <parameter key="main_criterion" value="kappa"/>
                <parameter key="classification_error" value="true"/>
                <parameter key="kappa" value="true"/>
                <parameter key="precision" value="true"/>
                <parameter key="recall" value="true"/>
                <parameter key="lift" value="true"/>
                <parameter key="fallout" value="true"/>
                <parameter key="f_measure" value="true"/>
                <parameter key="false_positive" value="true"/>
                <parameter key="false_negative" value="true"/>
                <parameter key="true_positive" value="true"/>
                <parameter key="true_negative" value="true"/>
                <parameter key="sensitivity" value="true"/>
                <parameter key="specificity" value="true"/>
                <parameter key="youden" value="true"/>
                <parameter key="positive_predictive_value" value="true"/>
                <parameter key="negative_predictive_value" value="true"/>
                <parameter key="skip_undefined_labels" value="false"/>
                <parameter key="use_example_weights" value="false"/>
              </operator>
              <operator activated="true" class="log" compatibility="6.4.000" expanded="true" height="76" name="Log" width="90" x="447" y="30">
                <list key="log">
                  <parameter key="confidence_threshold" value="operator.TryThreshold.parameter.threshold"/>
                  <parameter key="accuracy" value="operator.Best Threshold.value.accuracy"/>
                  <parameter key="true_negative" value="operator.Best Threshold.value.true_negative"/>
                  <parameter key="false_negative" value="operator.Best Threshold.value.false_negative"/>
                  <parameter key="true_positive" value="operator.Best Threshold.value.true_positive"/>
                  <parameter key="false_positive" value="operator.Best Threshold.value.false_positive"/>
                  <parameter key="sensitivity" value="operator.Best Threshold.value.sensitivity"/>
                  <parameter key="specificity" value="operator.Best Threshold.value.specificity"/>
                  <parameter key="precision" value="operator.Best Threshold.value.precision"/>
                  <parameter key="recall" value="operator.Best Threshold.value.recall"/>
                </list>
              </operator>
              <connect from_port="input 1" to_op="Apply Threshold (2)" to_port="example set"/>
              <connect from_op="TryThreshold" from_port="output" to_op="Apply Threshold (2)" to_port="threshold"/>
              <connect from_op="Apply Threshold (2)" from_port="example set" to_op="Best Threshold" to_port="labelled data"/>
              <connect from_op="Best Threshold" from_port="performance" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log_to_data" compatibility="6.4.000" expanded="true" height="94" name="Tested Threshold Table" width="90" x="782" y="120"/>
          <connect from_op="Retrieve Ripley-Set" from_port="output" to_op="Nominal to Binominal" to_port="example set input"/>
          <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
          <connect from_op="Logistic Regression" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Logistic Regression" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Original Performance" to_port="labelled data"/>
          <connect from_op="Original Performance" from_port="performance" to_port="result 1"/>
          <connect from_op="Original Performance" from_port="example set" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_op="Tested Threshold Table" to_port="through 1"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 4"/>
          <connect from_op="Tested Threshold Table" from_port="exampleSet" to_port="result 2"/>
          <connect from_op="Tested Threshold Table" from_port="through 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="54"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="54"/>
          <portSpacing port="sink_result 5" spacing="0"/>
        </process>
      </operator>
    </process>
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,053  RM Data Scientist
    Hi John,

    i think your process is dangouerous, because you do not use a x-validation to ensure quality. This will tend to overestimate your performances.

    ~Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 574   Unicorn
    Yes I agree, this is a very shortened version of the process. 
    I removed all the X-Validations + number formatting and some other stuff.  It's just as a demo of the use of Create Threshold. 
Sign In or Register to comment.