RapidMiner

Analyze 77000 tweets

Wisdom logo Registration now open for RapidMiner Wisdom Americas | New Orleans | October 10-12, 2018   Learn More
Highlighted
Learner III vittorio_confuo
Learner III

Analyze 77000 tweets

Dear community,

 I have to deal with a dataset of 77000 tweet with the following attributes: post_id, username, hash_tag, sent_time, text, user_id, source, is_retweet, is_reply, lang, retweet_count, reply_count, latitude, longitude. I must do an analysis using association rules and clustering but I'm new on RM and I hope someone can give me advice on how to proceed. 

My first problem is the free license: I can read only 10000 lines. Do operators exist that generate a significant sample? 

Second problem: what kind of association rules can I use? I'm thinking of "manual" sentiment analysis ( I have seen that there is Aylien extension but it has limitation and it doesn't work with italian language): is there a way to find the most important words in the tweet in order to do a positive/negative classification? 

 

Can you suggest me some association rules and/or clustering algorithms that I could use? How could I interpret them?

 

I apologize for all these questions and I would be very greatful if someone wants is kind enough to help me!

Regards, 

Vittorio Confuorto

20 REPLIES
Unicorn
Unicorn

Re: Analyze 77000 tweets

Hi Vittorio,

 

regarding the license I would suggest to request a demo. It may be a temporary solution, but it's the best to get you started and see if the platform brings value to you. You may also be able to apply for an educational license.

 

https://rapidminer.com/contact-sales-request-demo/

 

https://rapidminer.com/educational-program/

 

Regarding the analysis, I also work with Twitter and I it's a very special case of text analysis. I think that clustering won't give you the results you want, because most of the words are just garbage and vary a lot from tweet to tweet. My suggestion would be to train a sentiment model using another dataset, and then apply it to the tweets. You have to somehow get your hands on labeled sentiment data in Italian.

 

Regards,

Sebastian

Re: Analyze 77000 tweets

Hi @vittorio_confuo,

 

As mentionned Aylien has limitation, and does not support Italian.

So I propose to use a Python script using the "textblob" library.

This script translate the tweet from italian to english and then extract the sentiment (negative, neutral, positive) : 

-1 < sentiment < -0.1 ==> negative

-0.1 < sentiment< 0.1 ==> neutral

0.1< sentiment < 1 ==> positive

 

Spelling_Correction_5.png

The process : 

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="Id,Text&#10;1,iphone telefono peggiore apple ha fatto ciao messaggio&#10;"/>
      </operator>
      <operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Macros" width="90" x="246" y="34">
        <list key="macros">
          <parameter key="textAttribute" value="'Text'"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
        <parameter key="script" value="import pandas as pd&#10;from textblob import TextBlob&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;textAttr = %{textAttribute}&#10;&#10;def translate(text):&#10;&#10;  transl = TextBlob(str(text))&#10;  trans = transl.translate(to = 'en') &#10;  return trans&#10;&#10;def sent(text):&#10;&#10;  transl = TextBlob(str(text))&#10;  trans = transl.sentiment.polarity&#10;  return trans&#10;    &#10;&#10;def rm_main(data): &#10;&#10;  data['translate'] = data[textAttr].apply(translate)&#10;  data['sentiment'] = data['translate'].apply(sent)&#10;    &#10;    # connect 2 output ports to see the results&#10;  return data"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.2.001" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
        <list key="function_descriptions">
          <parameter key="Sentiment" value="if(sentiment&lt;-0.1,&quot;negative&quot;,if(sentiment&lt;0.1,&quot;neutral&quot;,&quot;positive&quot;))"/>
        </list>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Set Macros" to_port="through 1"/>
      <connect from_op="Set Macros" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

To execute this proces, you have to : 

 - install python 

 - install textblob (pip install textblob)

 - set your text attribute in the Set Macros parameters.

 

I hope it helps

 

Regards,

 

Lionel

 

 

Learner III vittorio_confuo
Learner III

Re: Analyze 77000 tweets

Hi @SGolbert thank you very much. It is a good idea but actually I haven't so much time for getting by hands a dataset. If I do it (in the next week) I will post the result maybe it can be useful to someone. 

For the association rule I have thought to discretize the sent_time in order to find the most important topic for each time step (for example every hour).

For the size of the dataset do you know if the operator "Sample stritified" is a good one? 

 

Thank you for your help, 

Vittorio Confuorto

Learner III vittorio_confuo
Learner III

Re: Analyze 77000 tweets

Hi @lionelderkrikor, sorry but I've just seen your answer.

I have some problem with textblob installation. Can you tell me how to do it?

 

Thank you for your time, 

Vittorio Confuorto

Unicorn
Unicorn

Re: Analyze 77000 tweets

@lionelderkrikor thank you for that awesome python script! You just gave me so many ideas for application here! I have to work with this textblob library more!

Regards,
Thomas

Blog: Neural Market Trends

RapidMiner Tutorial Videos here!

Re: Analyze 77000 tweets

Hi @vittorio_confuo,

"I have some problem with textblob installation"

In order I can help you, can you be more precise ?

 

Regards,

 

Lionel

 

Learner III vittorio_confuo
Learner III

Re: Analyze 77000 tweets

Hi @lionelderkrikor , 

I solved this problem but I currently have another one. 

The process return me the following error:

Pic.jpeg

Do you know how I can solve it? 

You are very kind 

Regards, 

Vittorio Confuorto

 

 

Re: Analyze 77000 tweets

Hi @vittorio_confuo,

 

Can you share your dataset and your process, so that I can reproduce the bug.

 

Regards,

 

Lionel

Re: Analyze 77000 tweets

Hi @Thomas_Ott,

 

You're welcome,

 

Happy sentiment analysis !

 

Regards,

 

Lionel