Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Filter Stopwords (English) takes out a non-stopword token

AO1AO1 Member Posts: 3 Learner I

Greetings community,

I am learning to use RapidMiner to extract and to analyse occurrences of selected keywords in annual reports, prepared by commercial entities. RapidMiner works well for all the key words I study, except for one.

For some reason, Filter Stopwords (English) operator filters out word 'important' for the whole corpus of documents I study.

E.g. I have a document , where manual search shows me that it contains the following words of interest:

important - 11
importantly - 4
importance - 4

Using Process Documents from Files, with Filter Stopwords (English) operator ON, I can see only occurrences of the words 'importantly' and 'importance', having this operator OFF allows me also to extract the expected 11 occurrences of word 'important'.

I tried to change tokenizing from 'non letters' to 'linguistic tokens' option, but it did not help.

Question: Is it an (known) error?

( I don't see the </> icon to share my process )

Kind regards,

Answers

  • AO1AO1 Member Posts: 3 Learner I
    edited September 19

    Process and test document added

    <?xml version="1.0" encoding="UTF-8"?><process version="10.4.001">
      <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="10.4.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="text:process_document_from_file" compatibility="10.0.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="34">
    <list key="text_directories">
    <parameter key="test RM" value="C:/Users/ovsyanna/OneDrive - Lincoln University/My Documents/test for RM/test PDF"/>
    </list>
    <parameter key="file_pattern" value="*"/>
    <parameter key="extract_text_only" value="true"/>
    <parameter key="use_file_extension_as_type" value="true"/>
    <parameter key="content_type" value="txt"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="Term Frequency"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="false"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="10.0.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
    <parameter key="mode" value="non letters"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="English"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:transform_cases" compatibility="10.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="34">
    <parameter key="transform_to" value="lower case"/>
    </operator>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="10.0.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="380" y="34"/>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="10.0.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="648" y="34">
    <parameter key="condition" value="contains"/>
    <parameter key="string" value="importan"/>
    <parameter key="regular_expression" value="(important)"/>
    <parameter key="case_sensitive" value="false"/>
    <parameter key="invert condition" value="false"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

  • rjones13rjones13 Member Posts: 200 Unicorn

    Hi @AO1 ,

    I'm able to replicate on all the versions available to me. I will see if I can find out more from the development team. In the meantime, I would suggest using Filter Stopwords (Dictionary) for more fine-grained control.

    Best,

    Roland

Sign In or Register to comment.