RAPIDMINER 9.7 BETA ANNOUNCEMENT

The beta program for the RapidMiner 9.7 release is now available. Lots of amazing new improvements including true version control!

CLICK HERE TO DOWNLOAD

Count wordlist occurrences from data

vincentvincent Member Posts: 4 Contributor I
edited November 2018 in Help
Hi,

I want to use rapidminer for sentiment analysis. Currently I am struggling with what I presume is a very simple question, however I am unable to solve it.

I import data from a repository, one of the fields contains text. I also import multiple text files, using 'Process Documents From Files', with different sentiments like: positive and negative. 

As a result i want to have something like this:
Textpostivenegative
This is a bad text01
This is a good text10
The occurrences of positive and negative words from every text entry from the repository.

I currently use this but it does not work:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="6.1.000" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="210">
        <list key="text_directories">
          <parameter key="Positive" value="/Users/vincent/Documents/uni/Jaar 4/Thesis/Test data/wordlist/pos"/>
          <parameter key="Negative" value="/Users/vincent/Documents/uni/Jaar 4/Thesis/Test data/wordlist/neg"/>
        </list>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="6.1.000" expanded="true" height="60" name="Transform Cases (4)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="6.1.000" expanded="true" height="60" name="Tokenize (3)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="6.1.000" expanded="true" height="60" name="Filter Stopwords (3)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="6.1.000" expanded="true" height="60" name="Stem (3)" width="90" x="447" y="30"/>
          <connect from_port="document" to_op="Transform Cases (4)" to_port="document"/>
          <connect from_op="Transform Cases (4)" from_port="document" to_op="Tokenize (3)" to_port="document"/>
          <connect from_op="Tokenize (3)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
          <connect from_op="Filter Stopwords (3)" from_port="document" to_op="Stem (3)" to_port="document"/>
          <connect from_op="Stem (3)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="retrieve" compatibility="6.1.000" expanded="true" height="60" name="Retrieve wordcount test" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Template/wordcount test"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="6.1.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="120">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="6.1.000" expanded="true" height="60" name="Transform Cases (3)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="6.1.000" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="6.1.000" expanded="true" height="60" name="Stem (2)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="6.1.000" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="447" y="30"/>
          <connect from_port="document" to_op="Transform Cases (3)" to_port="document"/>
          <connect from_op="Transform Cases (3)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
      <connect from_op="Retrieve wordcount test" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Sorry for the newbie question.

Thank you in advance for helping.

Vincent

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,412  RM Data Scientist
    Hi,

    you might want to have a look at this tutorial: http://vancouverdata.blogspot.de/2010/11/text-analytics-with-rapidminer-loading.html

    If it does not help, i can of course give you additional ressources
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • vincentvincent Member Posts: 4 Contributor I
    I watched the video's. They were helpful however i could not find my specific problem, am I missing something?
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,412  RM Data Scientist
    Do you want to do something like this


    Table 1

    ID    Text
    1    acb
    2    def
    3    geh

    and

    Table2

    ID  Sentiment
    1    good
    2    bad
    3  good

    and want to have a combined table? If so, try join on ID
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • vincentvincent Member Posts: 4 Contributor I
    No, maybe i was unclear i would like to have this:

    Input files:
    Repository
    text
    This is a good text
    This is a bad text
    The sentiment .txt files (loaded with 'Process documents from files):
    Positive.txt
    good
    great
    awesome

    Negative.txt
    bad
    sad

    Output:
    TextPositiveNegative
    This is a good text10
    This is a bad text01
    Hopefully I have clarified myself a bit more.

    Thank you again for the help
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,412  RM Data Scientist
    so you want to count the number of occurences of the words in the dictionaries (Positive.txt,negative.txt) in your file?

    If so, have a look here: http://rapid-i.com/rapidforum/index.php/topic,8638.msg29140.html There i do pretty similar stuff.

    This seems somehow a thing some people try to do. I might write a tutorial for this.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • vincentvincent Member Posts: 4 Contributor I
    Sorry for my late reaction.

    I think you understand what i would like to achieve however. I do not see how this is possible with the post you referred me to.

    Do you have a more specific solution?

    Vincent
Sign In or Register to comment.