[SOLVED] Applying a pre-trained model on new data
I have the following concern. If I apply a model on data that have slightly different set of features that the data the model was trained on - what happens with the values of attriubtes not present in test data but present in the models and vice versa, is that a problem for the model to be applied correctly?
This problem occurs in text classification, as features are words, and feature set becomes wordlist. When I extract wordlist from a set of training documents, and then want to classify a new document, it is obvious that the features of new document will be different. How should this be handled?
I would expect that applying the old model on new data would anyway bring the same results as if the features vere extracted collectively, as missing values would be assumed 0, and they were anyway not present in test data. But, I have compared these two approaches:
1. Extracting features from all data set, dividing data to test and training data, learning classifier and measuring the accuracy
2. Dividing data to test and training data, extracting features from each set independently, and learning classifier and measuring the accuracy
and I found out that in the second case the classificaton accuracy is much lower (it went to 20% from 70%). Is that something I should have expected, is my logic wrong here? Is there any way to "fix" the new data to match the old model, or fix the old model to match the new data? Or am I having totally wrong approach here?