Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
validation and meta cost
Hello everyone. I am glad that these forums are so newbie friendly. I am able to run a simple decision tree learner and have had success using both cross validation and simple validation. What I would like to do it increase the class recall of the model -- in this case I have two outcomes (students who are at risk and those who are not) and 8 nominal, ordinal, and real attributes derived from a much larger set using logistic regression.
What I would like to do is attach costs to the model -- namely making it more costly to say that a student at risk will be fine -- but when I add the metacost operator and move the DT operator into it RM gives me an error:
What I would like to do is attach costs to the model -- namely making it more costly to say that a student at risk will be fine -- but when I add the metacost operator and move the DT operator into it RM gives me an error:
P Aug 23, 2008 11:32:42 PM: Test: Set parameters for com.rapidminer.operator.learner.meta.MetaCostModelThat snippet is from the log set in verbose mode, so I am wondering if I am just trying to do something invalid, if I have something out of order, or if I just don't know what I am doing. Thanks, --chris
P Aug 23, 2008 11:32:42 PM: Test: Applying com.rapidminer.operator.learner.meta.MetaCostModel
P Aug 23, 2008 11:32:42 PM: [Fatal] ArrayIndexOutOfBoundsException occured in 1st application of Test (ModelApplier)
P Aug 23, 2008 11:32:42 PM: [Fatal] Process failed: operator cannot be executed. Check the log messages...
Root[1] (Process)
+- ExampleSource[1] (ExampleSource)
+- SimpleValidation[1] (SimpleValidation)
+- MetaCost[1] (MetaCost)
| +- DecisionTree[10] (DecisionTree)
+- ApplierChain[1] (OperatorChain)
here ==> +- Test[1] (ModelApplier)
+- Performance[0] (Performance)
0
Answers
I rebuild the process and everything works fine. The exception occured in your process points to a deeper problem...Could you please so kind to rerun the process in debug-mode and post the detailed error warning ? You can activate the debug-mode in Tools->Preferences->DebugMode->General. Futhermore it would be helpful if you can post your process setup, too. Maybe the error has caused by a special combination of parameters.
Thank you
Steffen
[attachment deleted by admin]
I took a look at the code, the error occured here: Somehow you were able to produce a confidence of infinity, I guess, so the value of bestIndex has not changed which causes the ArrayIndexOutofBound Exception in line 165 of the same class. A strange way to ensure that the if-statement below evaluates to true in the first run, btw.
Since the mentioned confidences are a result of the confidences produced by the Learner embedded in Meta Cost, the Decision Tree could be the source of the problem (in combination with your data). Could you please so kind to rerun the process using another Learning Algorithm (NaiveBayes for example) and report what happened.
Running your process with another dataset works, so we got a very special case here...
This will help to track down the source
Steffen
Well, then it is a problem of the DecisionTree Learner. This smells like a hard-to-find-bug.
I guess you prefer Trees for your Learning Task because they are easier to understand by human beings. In this way, I suggest W-J48 (part of Weka, hence although part of RM ), because it implements C4.5, too (as Decision Tree Learner). Another commonly used alternative is ID3.
good night and good luck
Steffen
We'll inform you, once we have found the bug.
Regards,
Tobias
ok, forget my last post. I must admit that Steffen was absolutely right: the error is due to the confidence values the decision tree learner computes. Somehow the decison tree learner manages to generate unknowns as confidences (not infinity though). I yet do not know what is actually the problem (an empty leaf or whatever). Since I did not write the learner, it is hard for me to find. But I (we) will have a look ...
Regards,
Tobias
The decision tree learner produces unknown confidences in the case that attributes are unknown. This however is not necessarily a bug. Since I did not write the decision tree learner myself, I am not sure if it should support missing attribute values e.g. as C4.5 does. I will check that. If it should not support missing values, we will have to at least increase the robustness of the [tt]MetaCost[/tt] operator i.e. make it able to handle unknown confidences.
Tobias
ok, I got the confirmation that the decision tree currently can not handle missing values. Hence, to get your process working, you will either have to drop the examples having missing attribute values or to replenish the missing values which however might lead to inappropriate results. Maybe we will extend the decision tree learner sometime in so far that it can handle missing values as well.
Regards,
Tobias
I am an inferential stats guy so my gut reaction is, in the manner of SPSS, could there be a toggle in the DT setup that would tell it to ignore cases with missing data? Maybe I am just over reacting to the terrible state of the institutional data that I have inherited. Cheers, --crhis
well, missing values are always a problem in real-world data! There is not a toggle to ignore cases with missing values, but there is - of course - an operator to do so. Simply put an [tt]ExampleFilter[/tt] operator into the process. The parameter [tt]condition_class[/tt] has to be set to [tt]no_missing_attributes[/tt].
Cheers,
Tobias
I have a similar problem. My data set has missing values. When averaging them using the MissingValueReplenishment operator, the MetaCost operator in combination with NaiveBayes produces the same ArrayIndexOutOfBoundsException.
The reason are confidences evaluated to NaN. In this case the "if (conditionalRisk < bestValue)" block is never entered, and bestIndex keeps being set to -1 which produces an ArrayIndexOutOfBoundsException in line 165. Quickfix can be done be initializing bestIndex = 0 in line 146.
Here is my xml setting:
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.2">
<operator name="Root" class="Process" expanded="yes">
<operator name="CSVExampleSource" class="CSVExampleSource">
<parameter key="filename" value="C:\BI\Data Mining\Data Mining Cup\DMC2002\data_dmc2002_train.csv"/>
<parameter key="id_name" value="ID"/>
<parameter key="label_name" value="canceler"/>
</operator>
<operator name="MissingValueReplenishment" class="MissingValueReplenishment">
<list key="columns">
</list>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="create_complete_model" value="true"/>
<parameter key="keep_example_set" value="true"/>
<parameter key="number_of_validations" value="2"/>
<parameter key="sampling_type" value="shuffled sampling"/>
<operator name="MetaCost" class="MetaCost" expanded="yes">
<parameter key="cost_matrix" value="[0.0 43.8;5.7 0.0]"/>
<parameter key="iterations" value="2"/>
<parameter key="keep_example_set" value="true"/>
<parameter key="sampling_with_replacement" value="false"/>
<operator name="NaiveBayes" class="NaiveBayes">
<parameter key="keep_example_set" value="true"/>
</operator>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="CostEvaluator (2)" class="CostEvaluator">
<parameter key="cost_matrix" value="[0.0 43.8;5.7 0.0]"/>
</operator>
<operator name="ProcessLog" class="ProcessLog">
<list key="log">
<parameter key="generation" value="operator.FeatureSelection.value.generation"/>
<parameter key="performance" value="operator.FeatureSelection.value.performance"/>
</list>
</operator>
</operator>
</operator>
</operator>
</process>
thanks for sending this in and also for the detailed bug report (here and per mail). We have just introduced a bugfix which will be available in the next release.
Cheers,
Ingo