ROC curves Threshold

Legacy UserLegacy User Member Posts: 0 Newbie
edited August 8 in Help
Ingo and team,
How do you get rapid miner to output the threshold from ROC curves?
I'm trying to boot strap a dataset, to output the ROC area and also threshold.
At the moment, the threshold datawriter will give the threshold, but if we wish to repeat this 100 times, and to calculate the confidence interval of the bootstrapped threshold, is there an easy way to output threshold into the performance evaluator?

Thanks,
Leon
Tagged:

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,645  RM Founder
    Hi Leon,

    not sure if you want to come up with a result similar to the one depicted in the attached picture. It shows the ROC curve together with the confidence thresholds curve for a repeated run. The transparent regions show the standard deviation regions around the mean values (plotted with a solid line).

    If yes, this is possible with the latest CVS version now.

    Cheers,
    Ingo

    [attachment deleted by admin]
  • brianbakerbrianbaker Member Posts: 24  Maven
    Where do you specify the confidence parameter?

    I've tried adding:
    <parameter key="calculate_confidences" value="true"/>
    to Performance & ROC, but I am still not seeing the confidence band.

    What operator is it added to?

    Thanks!
  • steffensteffen Member Posts: 347  Guru
    Hello Brian

    Not sure what you want to achieve and what ingo meant with "this is possible with the latest CVS version now".

    Is this helpful ?

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Dokumente und Einstellungen\Besitzer\Eigene Dateien\rm_workspace\sample\data\golf.aml"/>
        </operator>
        <operator name="NaiveBayes" class="NaiveBayes">
            <parameter key="keep_example_set" value="true"/>
        </operator>
        <operator name="ModelApplier" class="ModelApplier">
            <list key="application_parameters">
            </list>
        </operator>
        <operator name="ParameterIteration" class="ParameterIteration" expanded="yes">
            <list key="parameters">
              <parameter key="ThresholdCreator.threshold" value="[0.0;1.0;10;linear]"/>
            </list>
            <operator name="ThresholdCreator" class="ThresholdCreator">
                <parameter key="threshold" value="1.0"/>
                <parameter key="first_class" value="no"/>
                <parameter key="second_class" value="yes"/>
            </operator>
            <operator name="ThresholdApplier" class="ThresholdApplier">
            </operator>
            <operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
                <parameter key="keep_example_set" value="true"/>
                <parameter key="precision" value="true"/>
            </operator>
            <operator name="ProcessLog" class="ProcessLog">
                <list key="log">
                  <parameter key="threshold" value="operator.ThresholdCreator.parameter.threshold"/>
                  <parameter key="precision" value="operator.BinominalClassificationPerformance.value.precision"/>
                </list>
            </operator>
        </operator>
    </operator>
    regards,

    Steffen
  • brianbakerbrianbaker Member Posts: 24  Maven
    Steffen,

    Thank you for your help!  This isn't quite what I'm looking for, but in the right direction. 

    the image Ingo placed earlier on this thread show a lightly colored band around the ROC lines.  I believe this represents a measure of confidence / precision at each threshold along the curve.  I can't figure out how to turn this band on.  I'd also like to know exactly what it represents; how, for instance, does it relate to the precision vs. threshold plot you provided?

    I assume it is related to the confidence in the performance log.  I'd like to know where this confidence comes from and how it is represented in the plot.  For instance, is it the height of the shaded region?
    <com.rapidminer.tools.math.ROCPoint id="246">
                    <falsePositives>139.0</falsePositives>
                    <truePositives>68.0</truePositives>
                    <confidence>0.19407850064775464</confidence>
    </com.rapidminer.tools.math.ROCPoint>
     
  • steffensteffen Member Posts: 347  Guru
    Hello Brian

    Regarding "what is plotted"
    [quote author=****]
    The transparent regions show the standard deviation regions around the mean values (plotted with a solid line).
    [/quote]

    Mean and standard deviation according to either the roc-curve (red line) or threshold (blue line). It is related to precision in that way, that for each given threshold you can calculate true positives and true negatives and hence the precision.

    I assume that you know how roc-curves are calculated, otherwise I recommend this excellent paper (here). For further clarification, please note the difference between a process using XValidation (which calculates a reliable estimate of the performance) and no validation at all (as in my demo process above), which causes severe overfitting and which was meant for demonstration purpose only.

    Regarding "what is saved"
    Your example represents the false positives and true positives at the given threshold (stored as "confidence" in the xml-file).  If you have created multiple plots, you will gain multiple entries like this which allows you to calculate mean and deviation. Which leads us to the last question:

    Regarding "how to turn on the band"
    The  band can be "turned on" by calculating more than one roc-curve, so that an average can be calculated. Compare this examples:

    One curve:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Dokumente und Einstellungen\Besitzer\Eigene Dateien\rm_workspace\sample\data\golf.aml"/>
        </operator>
        <operator name="NaiveBayes" class="NaiveBayes">
            <parameter key="keep_example_set" value="true"/>
        </operator>
        <operator name="ModelApplier" class="ModelApplier">
            <list key="application_parameters">
            </list>
        </operator>
        <operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
            <parameter key="AUC" value="true"/>
        </operator>
    </operator>

    More than one curve:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Dokumente und Einstellungen\Besitzer\Eigene Dateien\rm_workspace\sample\data\golf.aml"/>
        </operator>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="number_of_validations" value="2"/>
            <operator name="NaiveBayes" class="NaiveBayes">
                <parameter key="keep_example_set" value="true"/>
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="two_curves" class="BinominalClassificationPerformance">
                    <parameter key="AUC" value="true"/>
                    <parameter key="precision" value="true"/>
                </operator>
            </operator>
        </operator>
    </operator>
    hope this was helpful

    Steffen

    PS: To gain the single values which allow the calculation of the band in a table, you have to perform a process like this:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Dokumente und Einstellungen\Besitzer\Eigene Dateien\rm_workspace\sample\data\golf.aml"/>
        </operator>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="number_of_validations" value="2"/>
            <operator name="NaiveBayes" class="NaiveBayes">
                <parameter key="keep_example_set" value="true"/>
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="ParameterIteration" class="ParameterIteration" expanded="yes">
                    <list key="parameters">
                      <parameter key="ThresholdCreator.threshold" value="[0.0;1.0;10;linear]"/>
                    </list>
                    <operator name="ThresholdCreator" class="ThresholdCreator">
                        <parameter key="threshold" value="1.0"/>
                        <parameter key="first_class" value="no"/>
                        <parameter key="second_class" value="yes"/>
                    </operator>
                    <operator name="ThresholdApplier" class="ThresholdApplier">
                    </operator>
                    <operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
                        <parameter key="keep_example_set" value="true"/>
                        <parameter key="precision" value="true"/>
                    </operator>
                    <operator name="ProcessLog" class="ProcessLog">
                        <list key="log">
                          <parameter key="modeliteration" value="operator.ParameterIteration.value.applycount"/>
                          <parameter key="thresholditeration" value="operator.ParameterIteration.value.iteration"/>
                          <parameter key="threshold" value="operator.ThresholdCreator.parameter.threshold"/>
                          <parameter key="precision" value="operator.BinominalClassificationPerformance.value.precision"/>
                        </list>
                    </operator>
                </operator>
            </operator>
        </operator>
    </operator>

  • brianbakerbrianbaker Member Posts: 24  Maven
    Steffen,

    Thank you very much!  This clears up all my questions.  I was confused about Ingo's standard deviation comment when I first found this thread.  Now that I see this relates to multiple runs everything is clear.

    Thank you again for your effort and response!
    brian
Sign In or Register to comment.