Text Mining How to remove particular phrases in pre-processing

mob · January 2016

Whats the best way to remove repeated sentences from my documents during pre-processing ?

I have a example set that includes a "text" column and some other attributes. The text column was read in from files in a folder. The text itself has a number of repeated phrases that I "think" I should remove before mining as I think they would skew the word frequency.

Given the "Filter Stopwords (Dictionary)" can only remove 1 stopword per line how do I handle a case like wanting to remove "Assessment and Grading" but still keep the word assessment and the word grading if they are located elsewhere in the document and how do I expand it so I can add other sentences I need removed

MartinLiebig · January 2016

Sounds like Remove Document Parts?

~Martin

mob · January 2016

I thought that was for pulling out text I wanted to process further. Can I use it to dump repeated strings from the main text?

MartinLiebig · January 2016

There is Remove and Keep Document Parts, one is throwing out parts of a document, the other keeps documents. Both can be configured with a regex.

If you have an example set with keywords, you can use aggregate with concat on it to generate a Regex. This is a bit the manual way, but i think it is doable.

~Martin

mob · January 2016

Actually following testing my assumption about filter stopwords appears incorrect. You can add the entire phrase as a "stop word" 1 per line and it will be removed e.g. linguistic sentences.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Mining How to remove particular phrases in pre-processing

Answers