I would reccomend to correct distance measures with missing values or at least add some notes on how the distance is calculated (a warrning etc.).
Now the distance is calculated as:
public double calculateDistance(double[] value1, double[] value2) { double sum = 0.0; int counter = 0; for (int i = 0; i < value1.length; i++) { if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) { double diff = value1[i] - value2[i]; sum += diff * diff; counter++; } } if (counter > 0) { return Math.sqrt(sum); } else { return Double.NaN; } }
so the missing attributes are ignored, what means that for missing values the distance is smaller then for non-missing. In other words for kNN and other distance based methods the instances with missing values are prefarred/closer than the others. These leads to incorrect classification results.
The state of art pracitce is implemented as
if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) { double diff = value1[i] - value2[i]; sum += diff * diff; counter++; } else {
double diff = max(i) - min(i); sum += diff * diff; counter++;
}
where max(i) and min(i) are maximum and minimum value of given attribute in the training set,
or simply diff=1 if attribute is normalized.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.