The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

Prune a large set of features in case of text classification

In777In777 Member Posts: 29 Contributor II
edited December 2018 in Help

I am dealing with the binary text classifcation task. I've done several preprocessing steps for my training data (stopwords, stem, morphology, low case, n-grams creation etc.) and created TFIDF-Vector. I deleted the rare n-grams (prune belowe 5%)  and got 18000 n-grams. The choice of cuttoff is arbitrary and its borthers me.  Then I've applied linear C-SVM (LibSVM). Unfortenately, the accuracy of my model for test set is very low. I think, I have to many feautes left and want to reduce their amount. So I decided to use information gain to reduce all features to most informative words. So I used operator "Weight by information gain" and then "Select by Weights" after "Process Documents"-Operator. At the and I used cross-validation with the linear SVM in it. But I got an error that the sample does not include the meta data. I am not sure what am I doing wrong and how to improve it.

Besides, what is the best way to prune a large set of features down to a manageable set of the most discriminative features and how to implement it in Rapdiminer? How else can I improve the performance of my model?



  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I'm assuming that this post is related to the imbalanced thread you started here. I would definately start with balancing your training data first and then feeding it into your text processing. From the sounds of it, you setup for text processing sounds pretty standard. What I would consider is putting both the Text Processing and Validation with your Linear SVM inside a Optimize Parameters and vary the C for the SVM and the Pruning parameters. This way you can see if adjusting those parameters, with your balanced data, can get you some better performance. 

  • Options
    In777In777 Member Posts: 29 Contributor II

    Thank you a lot for the suggestions. I will try to play with the parameters first. Which performance measure is appropriate in this case for comparison? I would work with AUC?  If I still want to use feature reduction based on IG and Chi sq? How could it be implemented in case of text classification?

Sign In or Register to comment.