Options

"Stopwords Dictionary Won't Work"

james_hickmanjames_hickman Member Posts: 1 Contributor I
edited June 2019 in Help
I can't get the stopwords dictionary operator to work.

I would like to use it to treat whole sentences as stopwords. (I am processing emails and some of them contain text which is a reply to standard emails)

However, I have simplified things as much as possible to try and understand the operator.

I have a .txt file with 11 single words, 1 per line. I use this as the input file for the filter stopwords (Dictionary) operator
I then created a text file using these 11 words and a few random words. I used this as the document input for the filter stopwords (Dictionary) operator

I run the process and all the words are still present. XML below:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="open_file" compatibility="5.3.008" expanded="true" height="60" name="Open File" width="90" x="179" y="165">
        <parameter key="filename" value="C:\Users\james.hickman\Desktop\RTStopWordDictionary.txt"/>
      </operator>
      <operator activated="true" class="text:read_document" compatibility="5.3.000" expanded="true" height="60" name="Read Document" width="90" x="179" y="75">
        <parameter key="file" value="C:\Users\james.hickman\Desktop\testdoc.txt"/>
        <parameter key="encoding" value="UTF-8"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.3.000" expanded="true" height="76" name="Filter Stopwords (Dictionary)" width="90" x="380" y="120">
        <parameter key="file" value="C:\Users\james.hickman\Desktop\RTStopWordDictionary.txt"/>
      </operator>
      <connect from_op="Open File" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/>
      <connect from_op="Read Document" from_port="output" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
      <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    After you've read the document, it needs to be tokenized into sentences using the Tokenize operator with the parameter "Linguistic sentences". The sentences to be used as "stop sentences" need to include the full stop.

    After the Filter Stopword operator use a Process Documents operator to turn the document into an example set

    Here's a simple document
    This is a sentence.
    This is another sentence.
    Once upon a time.
    A glass of wine.
    Here's a simple stopword file containing sentences
    This is a false sentence.
    This is another sentence.
    Here's a process to use them
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.007">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="5.3.000" expanded="true" height="60" name="Read Document" width="90" x="112" y="75">
            <parameter key="file" value="c:\temp\sample.txt"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="112" y="165">
            <parameter key="mode" value="linguistic sentences"/>
          </operator>
          <operator activated="true" class="open_file" compatibility="5.3.007" expanded="true" height="60" name="Open File" width="90" x="246" y="255">
            <parameter key="filename" value="C:\temp\stop.txt"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.3.000" expanded="true" height="76" name="Filter Stopwords (Dictionary)" width="90" x="380" y="165">
            <parameter key="file" value="C:\Users\james.hickman\Desktop\RTStopWordDictionary.txt"/>
            <parameter key="case_sensitive" value="true"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="514" y="210">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (2)" width="90" x="246" y="30">
                <parameter key="mode" value="linguistic sentences"/>
              </operator>
              <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Open File" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Andrew
Sign In or Register to comment.