Options

"sentiment analysis using rapidminer"

sunnyfunghysunnyfunghy Member Posts: 19 Contributor II
edited May 2019 in Help
Hi, everyone,
      I am using rapidminer to do sentiment analysis according to the following website.

http://kmandcomputing.blogspot.com/2008/06/opinion-mining-with-rapidminer-quick.html.

I am using the latest version of rapidminer to do it.

Here is the codes:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.004">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
    <process expanded="true" height="242" width="346">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="69" y="106">
        <list key="text_directories">
          <parameter key="positive" value="C:\Documents and Settings\sunny\Desktop\mix20_rand700_tokens\tokens\pos"/>
          <parameter key="negative" value="C:\Documents and Settings\sunny\Desktop\mix20_rand700_tokens\tokens\neg"/>
        </list>
        <parameter key="datamanagement" value="short_sparse_array"/>
        <process expanded="true" height="517" width="709">
          <operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.001" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="165"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.1.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="45" y="300"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="75"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.1.004" expanded="true" height="112" name="Validation" width="90" x="246" y="75">
        <process expanded="true" height="517" width="329">
          <operator activated="true" class="support_vector_machine_linear" compatibility="5.1.004" expanded="true" height="76" name="SVM (Linear)" width="90" x="89" y="49"/>
          <connect from_port="training" to_op="SVM (Linear)" to_port="training set"/>
          <connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="517" width="329">
          <operator activated="true" class="apply_model" compatibility="5.1.004" expanded="true" height="76" name="Apply Model" width="90" x="100" y="45">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.1.004" expanded="true" height="76" name="Performance" width="90" x="45" y="165"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_port="result 1"/>
      <connect from_op="Validation" from_port="training" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>



The data set comes from this website:  http://www.cs.cornell.edu/People/pabo/movie-review-data/
I downloaded polarity dataset v0.9 to use sentiment analysis.

However, the simulating process is more than 2 days and still continuing. What wrong is it? I cannot understand why the process is running for more than 2 days? Is it common? Look forward to hearing from you soon.


Many thanks,
Sunny

Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Sunny,

    no, this is actually not normal for a data set containing only 1400 examples and  22748 regular attributes and 4 special ones (including the label). The text preprocessing only needed a couple of seconds, the training take ages.

    Well, an SVM which is usually pretty fast (compared to a neural net) for text analysis. However, if training is hard and many examples become support vectors, even SVM tend to become slow sometimes. The question was: Why is the training so hard?

    Finding the answer took me about half an hour (who can I send the bill to?  ;) ): You changed the data_management parameter of the text processing from "double_sparse_array" to "short_sparse_array". Why the heck did you do that for TFIDF? The consequence was that all TFIDF values were 0 and hence nothing can be learned at all. This IS pretty hard, don't you think?

    Ok, changed the datamanagement back to the sparse double representation and voila: The process finished in less than 5 minutes and delivered a performance of about 76%.

    By the way, you could try to prune down the number of dimensions (activate pruning in the preprocessing and remove everything below 1% and more than 90%. This will reduce the number of dimensions to about 4000 from the more than 20000 - most of them were simply crap). This will already deliver you a performance of about 83% in only 2 minutes which should be fine for many practical applications since in this time the complete preprocessing has been done and 11 models have been learned. Model application is faster anyway. If you want to further improce you could optimize the preprocessing and filter out uncertain cases.

    Another important thing: You forgot to connect the performance port ('ave' for 'average') to the process results. It bad to wait a couple of days, but it's even worse if you don't get the desired results then  ;)

    So, the only two things I have done were: 1) change the data management back to double and 2) connect the output port of the performance with a process result output port. The third thing I recommend is to activate pruning and try different preprocessing schemes.

    Hope that helps,
    Ingo

    Here is the new process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
        <process expanded="true" height="242" width="413">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
            <list key="text_directories">
              <parameter key="positive" value="XXX\tokens\pos"/>
              <parameter key="negative" value="XXX\tokens\neg"/>
            </list>
            <parameter key="prune_method" value="percentual"/>
            <parameter key="prunde_below_percent" value="1.0"/>
            <parameter key="prune_above_percent" value="90.0"/>
            <process expanded="true" height="517" width="709">
              <operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.1.001" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="165"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.1.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="45" y="300"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="75"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
              <connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="x_validation" compatibility="5.1.008" expanded="true" height="112" name="Validation" width="90" x="179" y="30">
            <process expanded="true" height="517" width="329">
              <operator activated="true" class="support_vector_machine_linear" compatibility="5.1.008" expanded="true" height="76" name="SVM (Linear)" width="90" x="45" y="30"/>
              <connect from_port="training" to_op="SVM (Linear)" to_port="training set"/>
              <connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="517" width="329">
              <operator activated="true" class="apply_model" compatibility="5.1.008" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.1.008" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="model" to_port="result 1"/>
          <connect from_op="Validation" from_port="training" to_port="result 2"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
  • Options
    sunnyfunghysunnyfunghy Member Posts: 19 Contributor II
    Hi, Ingo,
              Extremely many thanks for your help. The problem can be solved now.  ;D and I would like to ask further question. I discover that when I add more data from orginally 2 directories (positive and negative) to 5 directories (very positive, positive, neutral, negative and very negative). There is an error message which cannot handle polynominal label. Quick Fixes function provides several suggestions:
    1) convert label to binominal
    2) convert label to numerical
    3) add operator polynomial by binomial classification to predict a polynominal label using the binominal learner support vector machine (linear).
    4)add operator classification by regression to predict a nominal label using the regression learner support vector machine (linear).

    I choose number 4 suggesion at last. But it seems that the simulation takes long time again.  Is it correct or there is another solving function? Look forward to hearing from you soon. Thank you

    Cheers,
    Sunny

     
  • Options
    sunnyfunghysunnyfunghy Member Posts: 19 Contributor II
    Push ~~~Anyone can also give comments^^

    By the way, my computer has been simulating the above result for at least 21 hours after adding "very positve", "neutral", "very negative" function. Is it common?
  • Options
    sunnyfunghysunnyfunghy Member Posts: 19 Contributor II
    HiHi,
          I have used the above methods but using 20 directories to simulate the results. I used quick fixs function called "4)add operator classification by regression to predict a nominal label using the regression learner support vector machine (linear)". The simulation time is 16 hours and the accuracy is 46%. Is it weird or is it weird? Can the simulation time faster? Look forward to hearing from you soon. Thank you.


    Cheers,
    Sunny
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    no, that's probably not weird. If you use this quick fix, you will end up with about 20 models learned instead of 1 - and the learning problem might even get harder than before. In addition, the total number of texts was probably be increased.

    And 46% accuracy? Not too bad for 20 classes. Let's just assume that all classes are equally distributed, than simply guessing would correspond to 5%. In this sense, you are already much better than just guessing. I am not saying that this cannot be improved, I just mean that 16 hours and 46% for 20 classes are not necessarily a bad thing.

    I would suggest to start with simpler and faster learners - which are able to work on multiple classes themself - first before going to the more sophisticated ones. "Simplicity first" should be your guideline  ;)

    Cheers,
    Ingo
  • Options
    sukhsukh Member Posts: 43 Contributor II
    Respected Sir,

    Actually, i am using the process for sentiment analysis over the data which is an CSV file sentiment140(http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip) from the link below:
    http://help.sentiment140.com/for-students/
    the process is working fine.but the results in the sentiment column is only filled with all zeros at each row instead of some numeric value known as sentiment score.
    Secondly,  i am unable to parse the date of this file.as i want to get the date results in milliseconds, i tried using date to numeric operator but get the results as unable to parse the given date format. the XML code for the above is  given below:

    <code>
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.3.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.3.000" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Local Repository/CSVSentiment"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="6.3.000" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
            <parameter key="attribute_name" value="Sentiment"/>
            <list key="set_additional_roles">
              <parameter key="SentimentText" value="regular"/>
            </list>
          </operator>
          <operator activated="true" class="wordnet:open_wordnet_dictionary" compatibility="5.3.000" expanded="true" height="60" name="Open WordNet Dictionary" width="90" x="45" y="165">
            <parameter key="directory" value="/home/sukh/Documents/WordNet-3.0/dict"/>
          </operator>
          <operator activated="true" class="remember" compatibility="6.3.000" expanded="true" height="60" name="Remember" width="90" x="45" y="255">
            <parameter key="name" value="wordnet"/>
            <parameter key="io_object" value="WordnetDictionary"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="6.3.000" expanded="true" height="76" name="Subprocess" width="90" x="45" y="345">
            <process expanded="true">
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="6.1.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="165">
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.1.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="recall" compatibility="6.3.000" expanded="true" height="60" name="Recall" width="90" x="45" y="210">
                <parameter key="name" value="wordnet"/>
                <parameter key="io_object" value="WordnetDictionary"/>
                <parameter key="remove_from_store" value="false"/>
              </operator>
              <operator activated="false" class="wordnet:open_wordnet_dictionary" compatibility="5.3.000" expanded="true" height="60" name="Open WordNet Dictionary (2)" width="90" x="45" y="300"/>
              <operator activated="true" class="wordnet:find_sentiment_wordnet" compatibility="5.3.000" expanded="true" height="76" name="Extract Sentiment (English)" width="90" x="246" y="165">
                <parameter key="threshold" value="1.0"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Extract Sentiment (English)" to_port="document"/>
              <connect from_op="Recall" from_port="result" to_op="Extract Sentiment (English)" to_port="dictionary"/>
              <connect from_op="Extract Sentiment (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Open WordNet Dictionary" from_port="dictionary" to_op="Remember" to_port="store"/>
          <connect from_op="Remember" from_port="stored" to_op="Subprocess" to_port="in 1"/>
          <connect from_op="Subprocess" from_port="out 1" to_op="Process Documents from Data" to_port="word list"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    </code>

    Hoping a positive response.

    Thanks and Regards:
    Sukh
  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello Sukh

    Nice to see you using the process I created ;)

    You are passing an attribute called Sentiment to the Process Documents step - I am not sure what will happen when this operator tries to add a Sentiment and finds there is already one.

    regards

    Andrew
  • Options
    sukhsukh Member Posts: 43 Contributor II
    Sir, i am  using the process you have created. i am very thankful to you for that. But now i got stuck in the processing because now i am using different file as an input downloaded from :
    https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit?pli=1

    i get unsatisfactory results as in previous case we get sentiscore foe each document, here i am getting zero in that particular column where  it is  expected to get sentiscore as in previous case.
    Kindly help me for that.

    Thanks and Regards:
    Sukh
  • Options
    sukhsukh Member Posts: 43 Contributor II
    Thanks alot to all of you specially Andrew. i could fix it.


    Regards:
    Sukh
Sign In or Register to comment.