RapidMiner

Distance measure with missing values

Status: Open For Voting

I would reccomend to correct distance measures with missing values or at least add some notes on how the distance is calculated (a warrning etc.).

Now the distance is calculated as:

 

public double calculateDistance(double[] value1, double[] value2) {
		double sum = 0.0;
		int counter = 0;
		for (int i = 0; i < value1.length; i++) {
			if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) {
				double diff = value1[i] - value2[i];
				sum += diff * diff;
				counter++;
			}
		}
		if (counter > 0) {
			return Math.sqrt(sum);
		} else {
			return Double.NaN;
		}
	}

so the missing attributes are ignored, what means that for missing values the distance is smaller then for non-missing. In other words for kNN and other distance based methods the instances with missing values are prefarred/closer than the others. These leads to incorrect classification results.

 

The state of art pracitce is implemented as

if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) {
				double diff = value1[i] - value2[i];
				sum += diff * diff;
				counter++;
			} else {
double diff = max(i) - min(i); sum += diff * diff; counter++;
}

where max(i) and min(i) are maximum and minimum value of given attribute in the training set,

or simply diff=1 if attribute is normalized.

1 Comment (1 New)
Comments
Community Manager
Status: Open For Voting