binary text classification test-set problem

yeahiii · November 2010

Hey,
I created a process to classify 2 categories of documents. Every works fine, while reducing the test set (from a different database / domain) to only 1 class (recall 99%). If I remove the filtering of the second class the whole process doesn't work anymore. I don't think it's a problem of overfitting, since the test data is coming from another database. Currently my setup looks like this:

DB-Training -> Process Documents (TF/IDF) -> Train libSVM --------------------------V

DB-Test (different db) -> Filter Class 1 -> Process Documents (TF/IDF) -> Apply Svm -> Performance (Recall of Class 2 = 99%)

I did NOT connect the wordlist of the training-db-"processed documents" to the test-db-"processed documents" one. If i do so, the recall decreases to 0%. Am I doing something wrong with the process-documents of the training-data part or am I missing something?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

binary text classification test-set problem