Options

# Different cost of misclassification for every single example – how to implement?

Hello all!

Please advise on the following problem.

[Dataset description]

Dataset contains about 10 thn examples with about 15-20 numeric attributes (not sure that all of them are useful, many are correlated, so this amount can be diminished to 7-10 attributes). Target (label) for these is binominal, let’s say it’s ‘One’ and ‘Two’.

By nature the data is quite noisy which means that it’s normal to talk about probability that case X belongs to class ‘One’ or ‘Two’. The normal expected probability is not expected to be higher than 0.75 (lower than 0.25) for 95% of cases.

The above is pretty typical. Now what I believe is special: for every case in dataset there is a confidence threshold.

Let’s say we have case X1, it’s actual class is ‘One’, threshold is T1. If we learn a model that predicts case X1 to be class ‘One’ with confidence P1, and P1 > T1 then Cost = 0. But if P1 < T1 then Cost = 1/T1.

For case X2 actual class is ‘Two’, threshold is T2. If a model predicts case X2 to be class ‘One’ with confidence P2, and P2 > T2 then Cost = 1/(1-T2). But if P2 < T2 then Cost = 0.

[Problem]

Normally I use some standard classifier like kNN which provides me with confidence output. Then I compare those confidences computed with my thresholds for every case (I do it with Excel). And this gives me final estimation for training set and for testing set. Then I compare performances and decide whether model is good or bad.

What I want is to make a model minimize the Cost function, which is sum of Costs for every case in set.

Does anyone know how It can be done with Rapidminer? I will appreciate any feedback.

Please advise on the following problem.

[Dataset description]

Dataset contains about 10 thn examples with about 15-20 numeric attributes (not sure that all of them are useful, many are correlated, so this amount can be diminished to 7-10 attributes). Target (label) for these is binominal, let’s say it’s ‘One’ and ‘Two’.

By nature the data is quite noisy which means that it’s normal to talk about probability that case X belongs to class ‘One’ or ‘Two’. The normal expected probability is not expected to be higher than 0.75 (lower than 0.25) for 95% of cases.

The above is pretty typical. Now what I believe is special: for every case in dataset there is a confidence threshold.

Let’s say we have case X1, it’s actual class is ‘One’, threshold is T1. If we learn a model that predicts case X1 to be class ‘One’ with confidence P1, and P1 > T1 then Cost = 0. But if P1 < T1 then Cost = 1/T1.

For case X2 actual class is ‘Two’, threshold is T2. If a model predicts case X2 to be class ‘One’ with confidence P2, and P2 > T2 then Cost = 1/(1-T2). But if P2 < T2 then Cost = 0.

[Problem]

Normally I use some standard classifier like kNN which provides me with confidence output. Then I compare those confidences computed with my thresholds for every case (I do it with Excel). And this gives me final estimation for training set and for testing set. Then I compare performances and decide whether model is good or bad.

What I want is to make a model minimize the Cost function, which is sum of Costs for every case in set.

Does anyone know how It can be done with Rapidminer? I will appreciate any feedback.

0

## Answers

66Contributor IIMaybe the custom cost function for the labels would be a good option for you? The Performance (Costs) operator does that (you can specify the cost for each misclassification in a matrix), and there is also a meta-modelling operator called MetaCost.

Hope this helps, gabor

4Contributor IPerformance doesn't really help as I want to learn using these costs, not just to estimate the result.

Or maybe I have to think of how to reformulate my task to make it solvable by existing algorythmes.

66Contributor II