[Solved] AUC for nominal features

aryan_hosseinzaaryan_hosseinza Member Posts: 74 Contributor II
edited June 2019 in Help
Hi ,

I am working with a decision tree as a binominal classifier , I want to measure AUC , but I am confused with the result , what threshold is changing, all features are nominal,

Thanks
Arian
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Arian,

    please post the process you have so far, and describe what exactly confuses you.

    Best, Marius
  • aryan_hosseinzaaryan_hosseinza Member Posts: 74 Contributor II
    I want to know the general setting of AUC in binominal classification evaluation ,


    When you want to draw AUC curve , you change a threshold and caluclate TPR and FPR for each threshold and you draw the corresponding curve, but what is threshold changing here ? 

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="450" width="882">
          <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="112" y="120">
            <parameter key="repository_entry" value="descritized_GI_FROM_MI50"/>
          </operator>
          <operator activated="true" class="sample_stratified" compatibility="5.2.008" expanded="true" height="76" name="Sample (2)" width="90" x="246" y="120">
            <parameter key="sample" value="relative"/>
          </operator>
          <operator activated="true" class="discretize_by_bins" compatibility="5.2.008" expanded="true" height="94" name="Discretize" width="90" x="380" y="120">
            <parameter key="number_of_bins" value="5"/>
            <parameter key="range_name_type" value="short"/>
          </operator>
          <operator activated="true" class="x_validation" compatibility="5.2.008" expanded="true" height="112" name="Validation (2)" width="90" x="514" y="120">
            <parameter key="parallelize_training" value="true"/>
            <parameter key="parallelize_testing" value="true"/>
            <process expanded="true" height="682" width="502">
              <operator activated="true" class="bagging" compatibility="5.2.008" expanded="true" height="76" name="Bagging" width="90" x="179" y="30">
                <parameter key="sample_ratio" value="0.4"/>
                <parameter key="iterations" value="40"/>
                <parameter key="parallelize_learning_process" value="true"/>
                <process expanded="true" height="682" width="1055">
                  <operator activated="true" class="multiply" compatibility="5.2.008" expanded="true" height="94" name="Multiply (2)" width="90" x="45" y="30"/>
                  <operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Filter Examples (4)" width="90" x="313" y="30">
                    <parameter key="condition_class" value="attribute_value_filter"/>
                    <parameter key="parameter_string" value="event=f"/>
                  </operator>
                  <operator activated="true" class="sample_stratified" compatibility="5.2.008" expanded="true" height="76" name="Sample (Stratified)" width="90" x="447" y="30">
                    <parameter key="sample" value="relative"/>
                    <parameter key="sample_ratio" value="0.8"/>
                  </operator>
                  <operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Filter Examples (3)" width="90" x="313" y="255">
                    <parameter key="condition_class" value="attribute_value_filter"/>
                    <parameter key="parameter_string" value="event=t"/>
                  </operator>
                  <operator activated="true" class="union" compatibility="5.2.008" expanded="true" height="76" name="Union" width="90" x="648" y="210"/>
                  <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree" width="90" x="782" y="210">
                    <parameter key="criterion" value="gini_index"/>
                    <parameter key="minimal_size_for_split" value="160"/>
                    <parameter key="minimal_leaf_size" value="80"/>
                    <parameter key="minimal_gain" value="0.01"/>
                    <parameter key="maximal_depth" value="10"/>
                    <parameter key="confidence" value="0.1"/>
                    <parameter key="number_of_prepruning_alternatives" value="10"/>
                    <parameter key="no_pre_pruning" value="true"/>
                  </operator>
                  <operator activated="false" class="naive_bayes" compatibility="5.2.008" expanded="true" height="76" name="Naive Bayes" width="90" x="782" y="30"/>
                  <connect from_port="training set" to_op="Multiply (2)" to_port="input"/>
                  <connect from_op="Multiply (2)" from_port="output 1" to_op="Filter Examples (4)" to_port="example set input"/>
                  <connect from_op="Multiply (2)" from_port="output 2" to_op="Filter Examples (3)" to_port="example set input"/>
                  <connect from_op="Filter Examples (4)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
                  <connect from_op="Sample (Stratified)" from_port="example set output" to_op="Union" to_port="example set 1"/>
                  <connect from_op="Filter Examples (3)" from_port="example set output" to_op="Union" to_port="example set 2"/>
                  <connect from_op="Union" from_port="union" to_op="Decision Tree" to_port="training set"/>
                  <connect from_op="Decision Tree" from_port="model" to_port="model"/>
                  <portSpacing port="source_training set" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                </process>
              </operator>
              <connect from_port="training" to_op="Bagging" to_port="training set"/>
              <connect from_op="Bagging" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="682" width="502">
              <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model (2)" width="90" x="112" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_binominal_classification" compatibility="5.2.008" expanded="true" height="76" name="Performance (2)" width="90" x="313" y="30">
                <parameter key="main_criterion" value="AUC"/>
                <parameter key="accuracy" value="false"/>
                <parameter key="AUC" value="true"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Sample (2)" to_port="example set input"/>
          <connect from_op="Sample (2)" from_port="example set output" to_op="Discretize" to_port="example set input"/>
          <connect from_op="Discretize" from_port="example set output" to_op="Validation (2)" to_port="training"/>
          <connect from_op="Validation (2)" from_port="averagable 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    you are right about the threshold, however, the threshold is changed for the confidence of being of belonging to a certain class. This does never have anything to do with the type of the input attributes.
    So if you have a decision tree, most leaves won't be pure. If such a leave consists of , let's say, 75% positive examples and 25% negatives, then a new examples which ends up in this leave has a confidence of 75% for being positive.
    You are using bagging, that means you are growing several decision trees and then predict by majority vote. But the more of the decision trees make the same classification, the higher the confidence of the composed bagging model.
    That means, that for decision trees and for bagging the confidences do not have a continuous range like svms or naive bayes, but only as many "steps" as there are leaves in the tree or classifiers in the bagging model.

    Best, Marius
  • aryan_hosseinzaaryan_hosseinza Member Posts: 74 Contributor II
    So you mean that in other algorithms (Naive bayes , SVM , etc.) the confidence is varying to draw the ROC ?

    Thanks for your answer, but there arises a basic question , how does the bagging work with decision trees ? at each step , all decision trees vote on what ? I got little confused

    Thanks ,
    Arian
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    aryan.hosseinzadeh wrote:

    So you mean that in other algorithms (Naive bayes , SVM , etc.) the confidence is varying to draw the ROC ?
    For these algorithms, any confidence value between 0 and 1 can be assigned to a new example, whereas for Decision Trees there are some discrete confidence levels which are defined by the class balances in the single leaves.

    [/quote]
    Thanks for your answer, but there arises a basic question , how does the bagging work with decision trees ? at each step , all decision trees vote on what ? I got little confused
    [/quote]

    Here we are talking about model application, obviously. So what bagging does you classify a new example is to pass it to each decision tree, and lets them make their decisions (i.e. classifications). Then, it collects the classifications and predicts the value which the majority of the trees predicted.

    Hope this helps!

    ~Marius
  • aryan_hosseinzaaryan_hosseinza Member Posts: 74 Contributor II
    Thanks , I got it ,

    I just checked it and found bagging doesn't make one final model ! I thought that would make one final model , but what about X-Validation , because it results one single final model although it trains several models on different portions of the input data, how does that work in Rapid Miner ? 
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hey Arian,

    if you connect the model output of the X-Validation, in addition to the 10 folds, it creates one more model on the complete data set. This is just for your convenience and has nothing to do with the performance estimation process.
  • aryan_hosseinzaaryan_hosseinza Member Posts: 74 Contributor II
    Now I get why it runs 11 rounds sometimes !

    Thanks
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Exactly, that's it :)
Sign In or Register to comment.