validation and meta cost

cgkolarcgkolar Member Posts: 29 Maven
edited November 2018 in Help
Hello everyone.  I am glad that these forums are so newbie friendly.  I am able to run a simple decision tree learner and have had success using both cross validation and simple validation.  What I would like to do it increase the class recall of the model -- in this case I have two outcomes (students who are at risk and those who are not) and 8 nominal, ordinal, and real attributes derived from a much larger set using logistic regression.

What I would like to do is attach costs to the model -- namely making it more costly to say that a student at risk will be fine -- but when I add the metacost operator and move the DT operator into it RM gives me an error:
P Aug 23, 2008 11:32:42 PM: Test: Set parameters for com.rapidminer.operator.learner.meta.MetaCostModel
P Aug 23, 2008 11:32:42 PM: Test: Applying com.rapidminer.operator.learner.meta.MetaCostModel
P Aug 23, 2008 11:32:42 PM: [Fatal] ArrayIndexOutOfBoundsException occured in 1st application of Test (ModelApplier)
P Aug 23, 2008 11:32:42 PM: [Fatal] Process failed: operator cannot be executed. Check the log messages...
          Root[1] (Process)
          +- ExampleSource[1] (ExampleSource)
          +- SimpleValidation[1] (SimpleValidation)
            +- MetaCost[1] (MetaCost)
            |  +- DecisionTree[10] (DecisionTree)
            +- ApplierChain[1] (OperatorChain)
here ==>      +- Test[1] (ModelApplier)
                +- Performance[0] (Performance)
That snippet is from the log set in verbose mode, so I am wondering if I am just trying to do something invalid, if I have something out of order, or if I just don't know what I am doing.  :)  Thanks,  --chris

Answers

  • steffensteffen Member Posts: 347 Maven
    Hello Chris

    I rebuild the process and everything works fine. The exception occured in your process points to a deeper problem...Could you please so kind to rerun the process in debug-mode and post the detailed error warning ? You can activate the debug-mode in Tools->Preferences->DebugMode->General. Futhermore it would be helpful if you can post your process setup, too. Maybe the error has caused by a special combination of parameters.

    Thank you

    Steffen

  • cgkolarcgkolar Member Posts: 29 Maven
    Thanks Steffen.  I am attaching the debug output here -- I have spent the morning banging my head against this one with no luck.  --chris

    [attachment deleted by admin]
  • steffensteffen Member Posts: 347 Maven
    Hello again

    I took a look at the code, the error occured here:

    double[] conditionalRisk = new double[numberOfClasses];
    int bestIndex = - 1;
    double bestValue = Double.POSITIVE_INFINITY;

    for (int i = 0; i < numberOfClasses; i++) {

    for (int j = 0; j < numberOfClasses; j++) {

    conditionalRisk += confidences[counter] * costMatrix;
    }
    if (conditionalRisk < bestValue) {
    bestValue = conditionalRisk;
    bestIndex = i;
    }
    }
    Somehow you were able to produce a confidence of infinity, I guess, so the value of bestIndex has not changed which causes the ArrayIndexOutofBound Exception in line 165 of the same class. A strange way to ensure that the if-statement below evaluates to true in the first run, btw.

    Since the mentioned confidences are a result of the confidences produced by the Learner embedded in Meta Cost,  the Decision Tree could be the source of the problem (in combination with your data). Could you please so kind to rerun the process using another Learning Algorithm (NaiveBayes for example) and report what happened.

    Running your process with another dataset works, so we got a very special case here...

    This will help to track down the source

    Steffen

  • cgkolarcgkolar Member Posts: 29 Maven
    Mystery deepens, I ran it with Naive Bayes and got no error.
  • steffensteffen Member Posts: 347 Maven
    Hello Chris

    Well, then it is a problem of the DecisionTree Learner. This smells like a hard-to-find-bug.
    I guess you prefer Trees for your Learning Task because they are easier to understand by human beings. In this way, I suggest W-J48 (part of Weka, hence although part of RM :) ), because it implements C4.5, too (as Decision Tree Learner). Another commonly used alternative is ID3.

    good night and good luck

    Steffen
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi Steffen, hi Chris,
    steffen wrote:

    Somehow you were able to produce a confidence of infinity, I guess, so the value of bestIndex has not changed which causes the ArrayIndexOutofBound Exception in line 165 of the same class. A strange way to ensure that the if-statement below evaluates to true in the first run, btw.

    Since the mentioned confidences are a result of the confidences produced by the Learner embedded in Meta Cost,  the Decision Tree could be the source of the problem (in combination with your data). Could you please so kind to rerun the process using another Learning Algorithm (NaiveBayes for example) and report what happened.
    Well, in principal the calculation of confidences is relatively straightforward which is why I do not know the decision tree learner is actually the problem here. I rather assume that the problem lies (once again) in the nominal mappings:

    if (hasLabel) {
    int labelIndex = getLabel().getMapping().mapString(classIndexMap.get((int)example.getLabel()));
    example.setValue(originalExampleSet.getAttributes().getCost(), costMatrix[bestIndex][labelIndex]);
    } else {
    example.setValue(originalExampleSet.getAttributes().getCost(), conditionalRisk[bestIndex]);
    }
    My guess is that the problem is an undefined label in the mapping which lets the labelindex take the value -1. However this is only a guess and I may be wrong. So, we will try to check that although it might be hard to reproduce the error.

    We'll inform you, once we have found the bug.
    Regards,
    Tobias
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi,

    ok, forget my last post. I must admit that Steffen was absolutely right: the error is due to the confidence values the decision tree learner computes. Somehow the decison tree learner manages to generate unknowns as confidences (not infinity though). I yet do not know what is actually the problem (an empty leaf or whatever). Since I did not write the learner, it is hard for me to find. But I (we) will have a look ... ;)

    Regards,
    Tobias
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    And yet another update:

    The decision tree learner produces unknown confidences in the case that attributes are unknown. This however is not necessarily a bug. Since I did not write the decision tree learner myself, I am not sure if it should support missing attribute values e.g. as C4.5 does. I will check that. If it should not support missing values, we will have to at least increase the robustness of the [tt]MetaCost[/tt] operator i.e. make it able to handle unknown confidences.

    Tobias
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi,

    ok, I got the confirmation that the decision tree currently can not handle missing values. Hence, to get your process working, you will either have to drop the examples having missing attribute values or to replenish the missing values which however might lead to inappropriate results. Maybe we will extend the decision tree learner sometime in so far that it can handle missing values as well.

    Regards,
    Tobias

  • cgkolarcgkolar Member Posts: 29 Maven
    Thank you so much Tobias and Steffen.  On the human side of the problem was my lack of understanding of how the program would react to missing values, I hope that it also provides some value to your coding efforts.

    I am an inferential stats guy so my gut reaction is, in the manner of SPSS, could there be a toggle in the DT setup that would tell it to ignore cases with missing data?  Maybe I am just over reacting to the terrible state of the institutional data that I have inherited.  :)  Cheers,  --crhis
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi Chris,

    well, missing values are always a problem in real-world data! ;) There is not a toggle to ignore cases with missing values, but there is - of course - an operator to do so. Simply put an [tt]ExampleFilter[/tt] operator into the process. The parameter [tt]condition_class[/tt] has to be set to [tt]no_missing_attributes[/tt].

    Cheers,
    Tobias
  • Legacy UserLegacy User Member Posts: 0 Newbie
    Hi Tobias,

    I have a similar problem. My data set has missing values. When averaging them using the MissingValueReplenishment operator, the MetaCost operator in combination with NaiveBayes produces the same ArrayIndexOutOfBoundsException.

    The reason are confidences evaluated to NaN. In this case the "if (conditionalRisk < bestValue)" block is never entered, and bestIndex keeps being set to -1 which produces an ArrayIndexOutOfBoundsException in line 165. Quickfix can be done be initializing bestIndex = 0 in line 146.

    Here is my xml setting:

    <?xml version="1.0" encoding="windows-1252"?>
    <process version="4.2">

      <operator name="Root" class="Process" expanded="yes">
          <operator name="CSVExampleSource" class="CSVExampleSource">
              <parameter key="filename" value="C:\BI\Data Mining\Data Mining Cup\DMC2002\data_dmc2002_train.csv"/>
              <parameter key="id_name" value="ID"/>
              <parameter key="label_name" value="canceler"/>
          </operator>
          <operator name="MissingValueReplenishment" class="MissingValueReplenishment">
              <list key="columns">
              </list>
          </operator>
          <operator name="XValidation" class="XValidation" expanded="yes">
              <parameter key="create_complete_model" value="true"/>
              <parameter key="keep_example_set" value="true"/>
              <parameter key="number_of_validations" value="2"/>
              <parameter key="sampling_type" value="shuffled sampling"/>
              <operator name="MetaCost" class="MetaCost" expanded="yes">
                  <parameter key="cost_matrix" value="[0.0 43.8;5.7 0.0]"/>
                  <parameter key="iterations" value="2"/>
                  <parameter key="keep_example_set" value="true"/>
                  <parameter key="sampling_with_replacement" value="false"/>
                  <operator name="NaiveBayes" class="NaiveBayes">
                      <parameter key="keep_example_set" value="true"/>
                  </operator>
              </operator>
              <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                  <operator name="ModelApplier" class="ModelApplier">
                      <list key="application_parameters">
                      </list>
                  </operator>
                  <operator name="CostEvaluator (2)" class="CostEvaluator">
                      <parameter key="cost_matrix" value="[0.0 43.8;5.7 0.0]"/>
                  </operator>
                  <operator name="ProcessLog" class="ProcessLog">
                      <list key="log">
                        <parameter key="generation" value="operator.FeatureSelection.value.generation"/>
                        <parameter key="performance" value="operator.FeatureSelection.value.performance"/>
                      </list>
                  </operator>
              </operator>
          </operator>
      </operator>

    </process>
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    thanks for sending this in and also for the detailed bug report (here and per mail). We have just introduced a bugfix which will be available in the next release.

    Cheers,
    Ingo
Sign In or Register to comment.