Distance measure with missing values

marcin_blachnikmarcin_blachnik Member Posts: 61 Guru
edited December 2018 in Product Feedback - Resolved

I would reccomend to correct distance measures with missing values or at least add some notes on how the distance is calculated (a warrning etc.).

Now the distance is calculated as:

 

public double calculateDistance(double[] value1, double[] value2) {
double sum = 0.0;
int counter = 0;
for (int i = 0; i < value1.length; i++) {
if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) {
double diff = value1[i] - value2[i];
sum += diff * diff;
counter++;
}
}
if (counter > 0) {
return Math.sqrt(sum);
} else {
return Double.NaN;
}
}

so the missing attributes are ignored, what means that for missing values the distance is smaller then for non-missing. In other words for kNN and other distance based methods the instances with missing values are prefarred/closer than the others. These leads to incorrect classification results.

 

The state of art pracitce is implemented as

if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) {
double diff = value1[i] - value2[i];
sum += diff * diff;
counter++;
} else {
double diff = max(i) - min(i);
sum += diff * diff;
counter++;
}

where max(i) and min(i) are maximum and minimum value of given attribute in the training set,

or simply diff=1 if attribute is normalized.

1
1 votes

Declined · Last Updated

no comments or votes in over a year - closing this idea for now. Please comment if still relevant.

Comments

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
Sign In or Register to comment.