question about k-nn (in production environment)
hi,
I have tested k-nn on my dataset and get pretty "good" results, about 85% with Camberra distance and k = 5...
my question now is, is k-nn also well suited to classify new instances in a lets say "real" production environment? where new test data comes in from time to time?
because I have read somewhere that knn should be hard to compute (however it just takes some seconds with 4500 datasets here), and that k-nn has to be computed in total from beginning on for every new instance that comes in, is that true?
I mean, if I already have placed my n training instances in my m-dimensional space, if one new test instance comes in, do I have to calculate the distance from this instance to ALL other n instances, or to the k nearest instances only? and if so, how does it know what the k-nearest instances are? I mean, the new instance cannot be "aware" of itself where it has to place itself in the m-dimensional space, and what its nearest members are, or is that somehow possible to "remember" the testing instances, and choose the k-nearest members according to some heuristics?
Answers
Dear Fred,
you are partly right.
The k-NN model is in fact the well defined stored full data set. During application the algorithm needs to search for the k-next neighbours to get a classification. This search might take quite a while. Especially if you have a lot of training data
On the other hand - about what data sizes are we talking? And about which response times? If you need response times >1sec and have less than 100k rows i would imagine that this makes no problem.
~Martin
Dortmund, Germany
well, I have only 4500 examples in my data set, I use 70% for training and 30% for testing in my X-Validation. I am using subset selection of attributes (6 out of 25) and I'm getting around 80-85% accuracy in classification.
therefore its no problem it goes around some seconds to calculate a class of a new point...
however, I am using an optimize parameter search grid with 6000 combinations, thats why I get around 85% on some parameter settings... and I would like to know if I have to build the models from beginning every time a new instance comes in, or if its possible to store a bunch of k-nn models and do the k-nn comparison on a new instance "on-the-fly" ? because with parameter optimization, it takes about some hours (!) to built up all those models and achieve a 85% accuracy.
the thing I wanted to do originally, is to calculate about the 10 best k-nn models (with highest accuracy) from a subset of training sets, with the best parameters, and then use those 10 models for the new test instances in a majority-vote manner. Is this the same as bagging or boosting with k-nn model?
example:
If have the performance of the first 10 models for k-nn, I want to use those 10 for future new instances to classify...
I need to know if my idea above is a good one and if the 10 model majority vote could really provide the around 85% accuracy for future new instance classifications...
Fred,
what you propose is voting But also an ensemble method - so kind of similar.
Please keep in mind that in X-Val you use the model built on the full data in production. The 70-30 are only to measure the performance.
~Martin
Dortmund, Germany
hm but to understand you correctly, in X-Val, I use 70% of my data for training the model, and the other 30% are for testing only, correct? therefore, the model has not seen the 30% for testing before, am I right?
my original question is still not answered.. I wanted to know if I have to built the model every time again with all training data if some new instance to classify comes in - or if I can store the model and retrieve it when necessary to classify a new instance against it...?
Hi Fred,
first let me answer your question: Of course you can store your model and retrieve it laster for application. The Operator is called "Store" and "Retrieve". Store stores it somewhere in your repository (you can store ANYTHING that goes over a connection line, data, models, performance, ...) and with Retrieve you get it back in your process.
(Otherwise it would be quite unhandy for real applications and more complex models)
And now I think I earned to be a smart-*hidden* and point out that if you do a 70% / 30% split, it's actually NOT a cross-validation (X-Val in RapidMiner language), but a Split Validation. This is generally faster and less accurate in the estimation of the performance.
I would also be curious how your class balance is, because 85% of accuracy is not telling whether this is good or not. Just imaging detecting terrorists. Having a 85% accuracy would mean, that we would probably classify 15% of population as terrorists...So accuracy depends on the class balance.
Greetings,
Sebastian
yeah, you're probably right, but I used X-Validation later (yes, the operator's name is X-Validation )
and class distribution is about 50/30/20, I used MetaCost to penalty wrong classification for the 30/20 class distribution more... however I get around 70-80% as precision and recall on those 2 classes..
@mschmitz, regarding your quote:
so it only needs to compare the new instance to all other instances that are already in the trained model, but you don't need to train the complete model with all other data points again, as far as I understood, thats what I wanted to know. Ok thanks
But Bagging with majority vote probably makes no sense, as k-nn is quite indifferent in the models if you have the same k I guess maybe only if you apply the new data to models with different k values.... but questionable if that makes sense, too...
Fred,
training a k-NN is essentially storing it. Yes you don't need to do it.
Bagging on a k-nn might make sense. The approach of bagging does also apply to k-NN. I am not sure if it helps though, because the effect is most likely smaller than for tree base algorithms.
~Martin
Dortmund, Germany