Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Feature selection / reduction text-mining

In777In777 Member Posts: 29 Contributor II
edited December 2018 in Help

I work with binary unbalanced text-mining classification problem. I have 1000 sentences for one class and 20000 sentences for the outside the class. First I created balance sample (over-sampling/ copy several times). Then I pre-processed the sentences - tokenize, delete stopwords, morphological standardization, filter words less then 2 characters, stem, low case, create n-grams (N=3). I've used TF-IDF for weighting and deleted (prune) n-grams that are rare (occur less then 5% of documents). Then I used C-SVM (LibSVM) (alternative: Bayes) to learn the model from 18000 features. The cross-validation accuracy, recall etc. was great - 98%. Then I used handout set and find that the unseen sentances are classified incorrectly, e.g. sentences that contain informative words from TFIDF list created by the model are classified wrongly. If I use under-sampling the accuracy of my model is only 60%. I am confused.

I presume I have to use some feature reduction/selection techniques (e.g. chi squared, p-value for each n-gram) to improve the situation by I do not understand which to choose how to implement them in RapidMiner or Python. I deleted only 5% of the rare words, but the choice is arbitrary. How can I conduct feature reduction/optimization for text classification in general? What else could cause such problem?

Sign In or Register to comment.