# Equivalent of a cost matrix for regression?

Hi,

I'm wondering if there's something conceptually similar to a classification cost matrix that can be used in regression?

I have a regression problem where I want to predict a numeric label that it usually in the range of 1-10 (sometimes a bit higher, but is never negative) using a combination of numeric and nominal attributes (including interaction terms).

However, as a practical matter we're much more interested in accuracy at the lower end of the prediction range than the upper end. e.g. predicting a actual 3.5 value as 4.5 is worse than predicting an actual 7.5 as 9.5, even though the absolute error in the latter is greater than the former.

The two approaches I've thought of are:

1) Binning examples into ranges of values (e.g. 1.0-1.5, 1.5-2.0, 2.0-3.0, 4.0-5.0, 6.0-8.0, 8.0+), and using classification learner with MetaCost to predict the bin using greater penalties for misclassifications for the bins in the lower range. We'd prefer not to go this route because we are interested in understanding the effect of different attributes and interactions on the predicted value, whereas classification would lock us into a single expected value for each bin. We also want to be able to make fine-grained distinctions in predicted values for examples that would end up grouped into a bin together.

2) Using example weights to consider examples with lower-valued labels as more important. This isn't ideal as there are already weights attached to each example that we are using that reflect the reliability of the measurement.

Is there another strategy that might help here?

Thanks,

Keith

I'm wondering if there's something conceptually similar to a classification cost matrix that can be used in regression?

I have a regression problem where I want to predict a numeric label that it usually in the range of 1-10 (sometimes a bit higher, but is never negative) using a combination of numeric and nominal attributes (including interaction terms).

However, as a practical matter we're much more interested in accuracy at the lower end of the prediction range than the upper end. e.g. predicting a actual 3.5 value as 4.5 is worse than predicting an actual 7.5 as 9.5, even though the absolute error in the latter is greater than the former.

The two approaches I've thought of are:

1) Binning examples into ranges of values (e.g. 1.0-1.5, 1.5-2.0, 2.0-3.0, 4.0-5.0, 6.0-8.0, 8.0+), and using classification learner with MetaCost to predict the bin using greater penalties for misclassifications for the bins in the lower range. We'd prefer not to go this route because we are interested in understanding the effect of different attributes and interactions on the predicted value, whereas classification would lock us into a single expected value for each bin. We also want to be able to make fine-grained distinctions in predicted values for examples that would end up grouped into a bin together.

2) Using example weights to consider examples with lower-valued labels as more important. This isn't ideal as there are already weights attached to each example that we are using that reflect the reliability of the measurement.

Is there another strategy that might help here?

Thanks,

Keith

Tagged:

0

## Answers

2,531Unicornwhat about performing a non linear transformation of the values? This way you could stretch the lower values more than the upper values. Hence the residuals in the lower values will be larger compared to the original residuals and hence giving a higher weight during the internal optimization. Might this solve the problem? Would be nice if you could give feedback, I'm a little bit curious

Greetings,

Sebastian