🥳 RAPIDMINER 9.9 IS OUT!!! 🥳

The updates in 9.9 power advanced use cases and offer productivity enhancements for users who prefer to code.

CLICK HERE TO DOWNLOAD

How to use filter stop words(dictionary) operator

AnushaAnusha Member Posts: 5 Contributor II
Hello All,

I'm struggling with the filter stop words operator, I have a table that has 3 columns. column_1 has a list of words, and the other column has text. I need to remove the list of words(which are present in column_1) from the text column. 
and also how can we remove user-defined/specific words from the text?
If I use the filter stop words operator, how can I convert the text column into the document as input to the operator?

Thanks in Advance!

Best Answer

  • SabaRGSabaRG Member Posts: 6 Contributor II
    edited May 1 Solution Accepted
    Hi @Anusha
    There are many options for this task:
    1- Using the "Replace" operator from "Blending\Values" to replace your words using Regular Expression. (it is a simple way).
    2- You can use "Filter Tokens Using ExampleSet" from the "Operator Toolbox\Text Processing" extension or "Filter Stopwords" from "HanMiner\Processing\Filtering" to define your stopword list and remove them from a document. In this case, you have to change your data to document and vice versa, so you can use "Loop Examples" and use the below operators to do your job:
    a) Use "Filter Example Range" to "%{example}" as the macro for the current row.
    b) Use the "Extract Document" operator from the "Text Processing" extension to convert your column attribute to a document with index 1 for example index
    c) Use your "Filter Stopwords" operator
    d) Use "Documents to Data" to convert your document to an example set again
    e) Use "Cartesian Product" to add your new data to other data
    f) Use "Select Attribute" to filter and remove old data
    Finally, you should use an "Append" outside the "Loop Parameters".
    I suggest using the first approach, but if you need other operations like tokenizing, stemming, ..., the second approach is appropriate.
    Anusha
Sign In or Register to comment.