Options

binary text classification test-set problem

yeahiiiyeahiii Member Posts: 3 Contributor I
edited November 2019 in Help
Hey,
I created a process to classify 2 categories of documents. Every works fine, while reducing the test set (from a different database / domain) to only 1 class (recall 99%). If I remove the filtering of the second class the whole process doesn't work anymore. I don't think it's a problem of overfitting, since the test data is coming from another database. Currently my setup looks like this:

DB-Training -> Process Documents (TF/IDF) -> Train libSVM --------------------------V

DB-Test (different db) -> Filter Class 1 -> Process Documents (TF/IDF) -> Apply Svm -> Performance (Recall of Class 2 = 99%)

I did NOT connect the wordlist of the training-db-"processed documents" to the test-db-"processed documents" one. If i do so, the recall decreases to 0%. Am I doing something wrong with the process-documents of the training-data part or am I missing something?
Sign In or Register to comment.