Questions about the k-Nearest-Neighbour implementation

kostadin · September 2008

I am using the k-Nearest-Neighbour operator to get a model for my example set. However, from the operator description alone I am not totally clear about how the algorithm is implemented. I checked the source code of the operator as well but it's difficult to understand.

1st question:
My example mixes numerical data and nominal data. With numerical data, there is no big problem in understanding the meaning of the term "nearest neighbour". It's different however with nominal data: For instance, there is an attribute with, let's say, 3 different possible nominal values or possibly a missing value: costs = low/medium/high/? (Btw: Is this called a 'polynominal attribute'?)

How does RapidMiner's KNN operator treat this when learning the model? Does it:
- skip such data? (Just ignoring it. The algorithm does not use this attribute for training.)
- use some kind of "binary matching decision" like: "If the current attribute value x_i is exactly the same as the target value x_j, then the neighbour is said to be 'near', whereas if they are different, the neighbour is said to be 'far'."?
- use any other algorithm?

2nd question:
Furthermore: How are missing values being treated (numerical and/or nominal)?

3d question:
How exactly is the weight being implemented that can be applied? In the context of KNN, as far as I know, the distance between x_i and x_j is multiplied with a weight

a / b

where
- a is the correlation coefficient and
- b is the standard deviation.
Is this what is meant with the "weighted_vote" parameter?

Thanks for the clarification.

kostadin · September 2008

I just saw in the other thread (http://rapid-i.com/rapidforum/index.php/topic,154.0.html) the following quote:

Euclidean distance for numerical and nominal values. For nomimal values, a distance of one is accounted if both values are not the same.

This would mean that for nominal values a distance of 0 is accounted if both values are the same. This clarifies already part of my 1st question.

kostadin · September 2008

Concerning the 3rd question:

If no special weights are applied to the distances, then the KNN implementation actually uses 1/k (whereas k is the number of nearest neighbours to be considered) per distance.

Now I got to understand what it does, when the box is checked... (Would be nice if such things be included in the documentary. Thanks.)

steffen · September 2008

Hello Kostadin

As I found out by looking into the java code...

Let x be the object to classifiy, and x_i the nearest neighbours, i=1,...,k

totalDistance = sum(dist(x,x_i)) for all i
totalSimiliarity= sum ( 1-(dist(x,x_i)/totalDistance)) for all i    <= see here

counter is then weighted by:
(1-(dist(x,x_i)/totalDistance))/totalSimilarity  <= weighted by normalized similarity

source: com.rapidminer.operator.learner.lazy.KNNClassificationModel.java

(Would be nice if such things be included in the documentary. Thanks.)

Well, this is a typical mandate for the wiki, because the manual is large enough and explaining every parameter in detail means exp(size). The wiki should be written mainly by the community, but... I must admit, that I cannot spare time at the moment, too, so I will remain silent....

Do you want to contribute ?

greetings

Steffen

kostadin · September 2008

Thanks Steffen,
[quote=Steffen]
Do you want to contribute ?
[/quote]
Just tried now to create a Wiki-page about KNN. (Actually, is Learning a good category?) Got problems with the session as it seems, the wiki reacts very slowly. And I'm thrown out and have to re-login again and again. I'll try again later on.

TobiasMalbrecht · September 2008

Hi Kostadin,

thank you for (at least) trying to contribute to the wiki. I just wanted to add an article (or rather some notes

) to the wiki myself yesterday evening, but it seems to be indeed not correctly working. Hence, we will have to check what is wrong. Maybe it is some sort of sourceforge error, but I don't know really ... so, when we come up with a solution (or the error vanishes magically

), we would greatly appreciate that you try to add your contribution again. Unfortunately, the RapidMiner wiki is in some way our problem child in the community since our staff is simply busy with project work and development and only few community members are contributing to the wiki so far.

Cheers,
Tobias

kostadin · September 2008

Okay, tried again to add an article about the nearest neighbors operator. Sorry dudes, does not work. After editing I always get this message:
[quote=Sourceforge]
Sorry! We could not process your edit due to a loss of session data. Please try again. If it still doesn't work, try logging out and logging back in.[/quote]
And as it seems I cannot save my stuff to the server. I stored the article on my harddrive - hope to be able to upload it soon.

Furthermore, Wiki formula editing does not work.

TobiasMalbrecht · September 2008

Hi Kostadin,

well, we actually have not had the time to see what is wrong there. But we will try to have a look into this issue as soon as we can. We will keep you up to date ...

Regards,
Tobias

kostadin · September 2008

Alright, concerning the 1st question:

Nominal attributes are internally represented "pointer-style-like", which means a double value points to a certain place where the original nominal value is stored. The EuclidianDistance will simply treat these values like numerical ones, therefore if two nominal attributes are compared and have the same value, their internal double representations will be the same and thus the distance will be 0.

What will be the distance if they are not equal? Can anybody answer this?

kostadin · September 2008

There is one thing I still don't understand (besides the question how missing values are treated):

If I don't misunderstand the code, both the KNNRegressionModel and the KNNClassificationModel always consider all attributes - including the label attribute - to build up the model. I can't see that the label attribute is excluded anywhere in the code for building up the model. This is weird - shouldn't the model explicitly not rely on the label attribute (since it is the dependent variable)?

steffen · September 2008

Hello Kostadin

KNNClassificationModel: I guess you refer to the following lines... (Constructor)


Attributes attributes = trainingSet.getAttributes();
sampleAttributeNames = new ArrayList<String>(attributes.size());
for (Attribute attribute: attributes) {
	sampleAttributeNames.add(attribute.getName());
}

In the for-loop the method iterator() of the class Attributes is called implicitly. Here is the code from the Interface "Attributes":


/** Iterates over all regular attributes. */
public Iterator<Attribute> iterator();

...which excludes all special Attributes, including the label.

I didnt check the KNNRegressionModel, but I guess it will be a similar call.

greetings

Steffen

IngoRM · September 2008

Hi,

yip, Steffen is right. The loop is only performed on the regular attributes which are delivered by the iterator() method of the class "Attributes". The special attributes (inlcuding the label) are of course skipped.

Cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Questions about the k-Nearest-Neighbour implementation

Answers