Prune a large set of features in case of text classification
I am dealing with the binary text classifcation task. I've done several preprocessing steps for my training data (stopwords, stem, morphology, low case, n-grams creation etc.) and created TFIDF-Vector. I deleted the rare n-grams (prune belowe 5%) and got 18000 n-grams. The choice of cuttoff is arbitrary and its borthers me. Then I've applied linear C-SVM (LibSVM). Unfortenately, the accuracy of my model for test set is very low. I think, I have to many feautes left and want to reduce their amount. So I decided to use information gain to reduce all features to most informative words. So I used operator "Weight by information gain" and then "Select by Weights" after "Process Documents"-Operator. At the and I used cross-validation with the linear SVM in it. But I got an error that the sample does not include the meta data. I am not sure what am I doing wrong and how to improve it.
Besides, what is the best way to prune a large set of features down to a manageable set of the most discriminative features and how to implement it in Rapdiminer? How else can I improve the performance of my model?