Text mining: Datacleaning and model ensembling?

kasper2304 · March 2013

Hi guys.

I need some help on elaboration a little on my choice of method and how optimally to do data cleaning and create and apply several trained models.

My case is the following:

Dataset: 2998 cases -> 337 positives & 2661 negatives
Partitioning: 85% for training and validation and 15% for testing -> 2262/286 for train and validation & 399/51 for testing

What i have read is that one can cluster negative cases and then train a model using the separate clusters with the positive cases for combining in the end. Is that a method anyone applied or can anyone explain a variant that can be performed in rapid miner.

I also looked into how to do data cleaning but i have no clue about which one to use for text mining as rapidminer provides several techniques.

Until now my method have simply been to downsample the majority class of my training and validation set providing the best results on my test set. I am using a SVM with linear kernel and the RBF kernel have not yielded better results. I did 3-grams, and stopword removal for preprocessing my text.

Best
Kasper

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text mining: Datacleaning and model ensembling?