The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

# Distance measure with missing values

Member Posts: 61 Guru
edited December 2018

I would reccomend to correct distance measures with missing values or at least add some notes on how the distance is calculated (a warrning etc.).

Now the distance is calculated as:

`public double calculateDistance(double[] value1, double[] value2) {		double sum = 0.0;		int counter = 0;		for (int i = 0; i < value1.length; i++) {			if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) {				double diff = value1[i] - value2[i];				sum += diff * diff;				counter++;			}		}		if (counter > 0) {			return Math.sqrt(sum);		} else {			return Double.NaN;		}	}`

so the missing attributes are ignored, what means that for missing values the distance is smaller then for non-missing. In other words for kNN and other distance based methods the instances with missing values are prefarred/closer than the others. These leads to incorrect classification results.

The state of art pracitce is implemented as

`if ((!Double.isNaN(value1[i])) && (!Double.isNaN(value2[i]))) {				double diff = value1[i] - value2[i];				sum += diff * diff;				counter++;			} else {                                double diff = max(i) - min(i);				sum += diff * diff;				counter++;                        }`

where max(i) and min(i) are maximum and minimum value of given attribute in the training set,

or simply diff=1 if attribute is normalized.

Tagged:
1