"SOLVED: Simple word-count of wordlist from document."

casnl · October 2013

Hi all,

I want to count the number of occurrences of 'positive' words in a set of 200 .txt files in 1 directory, seems simple but I'm not getting the results I'm expecting. And I think it has something to do with the way I'm reading my files.

Using the 'process documents from files' operator I'm accessing 200 .txt files from a single directory; in a second operator I try to create a wordlist from 1 .txt file (downloaded) containing 'positive words', I want to assess the number of occurrences of these positive words in my files. Basis processing applies (transform cases, stemming porter, tokenize and filter stopwords). However the output yields 0 occurrences in all 200 texts; which seems impossible. What am I missing here?

Thanks in advance.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="Projects" value="/Users/casvanandel/Dropbox/Thesis/Data/ContentData/Projects"/>
        </list>
        <parameter key="create_word_vector" value="false"/>
        <parameter key="vector_creation" value="Term Frequency"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="60"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
          <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="210">
        <list key="text_directories">
          <parameter key="Positive" value="/Users/casvanandel/Dropbox/Thesis/Data/opinion-lexicon-English/pos"/>
        </list>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (4)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (3)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (3)" width="90" x="447" y="30"/>
          <connect from_port="document" to_op="Transform Cases (4)" to_port="document"/>
          <connect from_op="Transform Cases (4)" from_port="document" to_op="Tokenize (3)" to_port="document"/>
          <connect from_op="Tokenize (3)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
          <connect from_op="Filter Stopwords (3)" from_port="document" to_op="Stem (3)" to_port="document"/>
          <connect from_op="Stem (3)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="120">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (3)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (2)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="447" y="30"/>
          <connect from_port="document" to_op="Transform Cases (3)" to_port="document"/>
          <connect from_op="Transform Cases (3)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Files (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

awchisholm · October 2013

Hello

Set the "create word vector" check box on the first process documents from files

regards

Andrew

casnl · October 2013

Hi Andrew,

Thanks for the feedback, however the process failed because: "The attribute abund was alredy present in the example set".

Also; I take it I have to create a term occurrences word vector in the first process documents from files?

Best,

awchisholm · October 2013

Hello

I had another look and realized what you were doing so my original suggestion was incorrect. If you use the Set Role operator on the output from the first Process Documents From Files and set the role of the text attribute to be regular, you should find that the final process documents operator will find the attribute and count the occurrences.

There is an alternative method that needs one less Process Documents operator. If you connect the word list output to the first process documents operator and enable document vector creation and term occurrences within that, you should get the same answer.

regards

Andrew

casnl · October 2013

Thanks for having another look! Helped me out.
Cheers

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"SOLVED: Simple word-count of wordlist from document."

Answers