"[SOLVED] Empty Word List"

beedaanbeedaan Member Posts: 4 Contributor I
edited June 2019 in Help
Hi All,

I am counting the occurrences of words in a txt document.  The text document has abstracts of other documents, as well as the document title.  The general format of the file is such:

<document name>
<abstract>
<white space>
...

This continues for roughly 36,00 documents.  The total size of the document is 46MB.  I am expecting to get a word list of word occurrences as a result.  What I actually get is an empty word list.  Here is my attached process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
   <process expanded="true" height="641" width="1024">
     <operator activated="true" class="text:read_document" compatibility="5.2.004" expanded="true" height="60" name="Read Document" width="90" x="179" y="75">
       <parameter key="file" value="C:\Users\Administrator\Desktop\DTIC_RDF\sample.xml"/>
     </operator>
     <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="447" y="75">
       <parameter key="create_word_vector" value="false"/>
       <parameter key="add_meta_information" value="false"/>
       <parameter key="keep_text" value="true"/>
       <parameter key="prune_method" value="absolute"/>
       <parameter key="prune_below_absolute" value="2"/>
       <parameter key="prune_above_absolute" value="9999"/>
       <process expanded="true" height="645" width="1024">
         <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="125" y="28"/>
         <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="313" y="75"/>
         <connect from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
         <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
     <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

I used this youtube video as a guide: https://www.youtube.com/watch?feature=endscreen&;NR=1&v=EjD2M4r4mBM

Please let me know what I am doing wrong.  Thanks.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Heya,

    it might be helpful if you check the option "create word vector" in the Process Documents operator :)
    Additionally, you are reading only one document, but your pruning settings are configured to ignore words which appear in less than two documents. So for testing I suggest to disable pruning.

    Happy mining,
    Marius
  • beedaanbeedaan Member Posts: 4 Contributor I
    Thanks for the help.  This worked for me.  I have a question though, I got it to work first by creating a word vector.  I got it to work again my not creating a word vector.  In my results, I still had a word list.  What does "create word vector" actually do?
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    It should prevent the creation of the word vector if disabled. However, I did not ever disable the option, because I see no reason why I would not create a wordlist.

    After changing options, it is generally a good idea to hit "enter" or click somewhere on the process pane to make sure that the changes are actually submitted. Maybe the options were not applied when you hit the run button (yes, this needs improvement  :-\ )

    Best, Marius
  • beedaanbeedaan Member Posts: 4 Contributor I
    Thanks for the response.  I'm tinkering around with some of the text association features.  I am having issues with the program crashing.  I can tell you what I am doing to get these crashes if you are interested. 
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Of course we are interested in that, but please open a new thread for it. If you get a dialog with "Submit Bug" you can also just click that button and describe everything in the dialog which will popup. That way the bug is submitted directly into our bug tracking system and won't get lost in the depths of the forum. Additionally, the bug report will contain some valuable information about the program state at the moment of the crash, which will greatly help us to fix it.
  • beedaanbeedaan Member Posts: 4 Contributor I
    Great!  Thanks for the reply
Sign In or Register to comment.