Options

"SOLVED: Simple word-count of wordlist from document."

casnlcasnl Member Posts: 5 Contributor II
edited June 2019 in Help
Hi all,

I want to count the number of occurrences of 'positive' words in a set of 200 .txt files in 1 directory, seems simple but I'm not getting the results I'm expecting. And I think it has something to do with the way I'm reading my files.

Using the 'process documents from files' operator I'm accessing 200 .txt files from a single directory; in a second operator I try to create a wordlist from 1 .txt file (downloaded) containing 'positive words', I want to assess the number of occurrences of these positive words in my files. Basis processing applies (transform cases, stemming porter, tokenize and filter stopwords). However the output yields 0 occurrences in all 200 texts; which seems impossible. What am I missing here?

Thanks in advance.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
       <list key="text_directories">
         <parameter key="Projects" value="/Users/casvanandel/Dropbox/Thesis/Data/ContentData/Projects"/>
       </list>
       <parameter key="create_word_vector" value="false"/>
       <parameter key="vector_creation" value="Term Frequency"/>
       <parameter key="keep_text" value="true"/>
       <parameter key="prune_below_absolute" value="2"/>
       <parameter key="prune_above_absolute" value="60"/>
       <process expanded="true">
         <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="45" y="30"/>
         <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30"/>
         <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
         <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
         <connect from_op="Transform Cases (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
         <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_op="Stem (Porter)" to_port="document"/>
         <connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
         <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="210">
       <list key="text_directories">
         <parameter key="Positive" value="/Users/casvanandel/Dropbox/Thesis/Data/opinion-lexicon-English/pos"/>
       </list>
       <parameter key="use_file_extension_as_type" value="false"/>
       <parameter key="vector_creation" value="Binary Term Occurrences"/>
       <parameter key="keep_text" value="true"/>
       <process expanded="true">
         <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (4)" width="90" x="45" y="30"/>
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="179" y="30"/>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (3)" width="90" x="313" y="30"/>
         <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (3)" width="90" x="447" y="30"/>
         <connect from_port="document" to_op="Transform Cases (4)" to_port="document"/>
         <connect from_op="Transform Cases (4)" from_port="document" to_op="Tokenize (3)" to_port="document"/>
         <connect from_op="Tokenize (3)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
         <connect from_op="Filter Stopwords (3)" from_port="document" to_op="Stem (3)" to_port="document"/>
         <connect from_op="Stem (3)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="120">
       <parameter key="vector_creation" value="Term Occurrences"/>
       <parameter key="keep_text" value="true"/>
       <list key="specify_weights"/>
       <process expanded="true">
         <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (3)" width="90" x="45" y="30"/>
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
         <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (2)" width="90" x="313" y="30"/>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="447" y="30"/>
         <connect from_port="document" to_op="Transform Cases (3)" to_port="document"/>
         <connect from_op="Transform Cases (3)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
         <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
         <connect from_op="Stem (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
         <connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Process Documents from Files" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
     <connect from_op="Process Documents from Files (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
     <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

Answers

  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Set the "create word vector" check box on the first process documents from files

    regards

    Andrew
  • Options
    casnlcasnl Member Posts: 5 Contributor II
    Hi Andrew,

    Thanks for the feedback, however the process failed because: "The attribute abund was alredy present in the example set".

    Also; I take it I have to create a term occurrences word vector in the first process documents from files?

    Best,
  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    I had another look and realized what you were doing so my original suggestion was incorrect. If you use the Set Role operator on the output from the first Process Documents From Files and set the role of the text attribute to be regular, you should find that the final process documents operator will find the attribute and count the occurrences.

    There is an alternative method that needs one less Process Documents operator. If you connect the word list output to the first process documents operator and enable document vector creation and term occurrences within that, you should get the same answer.


    regards

    Andrew
  • Options
    casnlcasnl Member Posts: 5 Contributor II
    Thanks for having another look! Helped me out.
    Cheers
Sign In or Register to comment.