Filter Stopwords with Regular Expression

Anna_May1Anna_May1 Member Posts: 14 Learner I
Hi guys,

I'm currently doing a sentiment analysis in Rapidminer with Knn. I want to count the number of words that are left in the document when removing stopwords. Using the "Filter stopwords" operator inside the "process documents from data operator" only works if I tokenize the data and use the "Nominal to Text" operator first. The issue here is that the output then is as in the image below. I want to be able to count the words that are left after removing the stopwords, so I wonder if there is maybe a regular expression which could be used inside a "Replace" operator or so, to only remove the stopwords without tokenizing it.



  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    @Anna_May1 I am unable to see the image as you have not attached it. However, it would be much easier to deal with stop words, or count words, after you tokenise the text. For example, you can have two streams of text processing, one with and and one without stop words, then for both you can count tokens and find the difference. In fact, when your text representation is by frequency, the counting is very simple - adding those frequencies within columns. 
Sign In or Register to comment.