Options

Text Mining How to remove particular phrases in pre-processing

mobmob Member Posts: 37 Contributor II
edited November 2018 in Help
Whats the best way to remove repeated sentences from my documents during pre-processing ?

I have a example set that includes a "text" column and some other attributes. The text column was read in from files in a folder. The text itself has a number of repeated phrases that I "think" I should remove before mining as I think they would skew the word frequency.

Given the "Filter Stopwords (Dictionary)" can only remove 1 stopword per line how do I handle a case like wanting to remove "Assessment and Grading" but still keep the word assessment and the word grading if they are located elsewhere in the document and how do I expand it so I can add other sentences I need removed

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,510 RM Data Scientist
    Sounds like Remove Document Parts?

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    mobmob Member Posts: 37 Contributor II
    I thought that was for pulling out text I wanted to process further. Can I use it to dump repeated strings from the main text?
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,510 RM Data Scientist
    There is Remove and Keep Document Parts, one is throwing out parts of a document, the other keeps documents. Both can be configured with a regex.

    If you have an example set with keywords, you can use aggregate with concat on it to generate a Regex. This is a bit the manual way, but i think it is doable.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    mobmob Member Posts: 37 Contributor II
    Actually following testing my assumption about filter stopwords appears incorrect. You can add the entire phrase as a "stop word" 1 per line and it will be removed e.g. linguistic sentences.
Sign In or Register to comment.