Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

[solved] Processing txt create sentiment analysis from mined twitter msgs

S_Green84S_Green84 Member Posts: 2 Contributor I
edited December 2019 in Help
Hi,
I am working on a project to create a (simple) sentiment analysis from several large text files (between 2 and 20gb) from mined twitter messages.
I have no computer science background and just found rapidminer the other day. Now Iam curious if it will be possible to use for my purpose.

All tweets are stored in a simple text file in the following format:
T 2009-06-07 02:07:41
U http://twitter.com/cyberplumber
W SPC Severe Thunderstorm Watch 339: WW 339 SEVERE TSTM KS NE 070200Z - 070800Z URGENT - IMMEDIATE BROADCAST REQUE.. http://tinyurl.com/5th9sw

I would like to create a sentiment index (positive / negative) for each single day.
The source for the sentiment index shall be as simple as possible. Therefore I thought to just define a few adjectives for each spectrum. Additionally / as a different index I would like to count the positive / negative smileys in each tweet.

As the dataset is in total 70gb I probably will have to create a (postgre sql) database first? I am right now trying to find a way to get the text files into a nice sql table (first time for me!). As my source is not a csv and instead of using commas they used "letter T/U/W and tab" for seperation I am also not quiet sure how to do this.

So my general question:
Is it possible to use rapidminer to perform this kind of sentiment analysis?
Is there maybe a viable option to use rapidminer for those large textfiles and circumvent creating a sql table (which has the difficulty of parsing the textfiles first).
Which tutorials / articles can you recommend me reading? (I found the vancouver data ones and they seem good)

If somebody here is willing to "coach" me for a couple hours to get me on track for my project in return for a small compensation ($20/hr) I would very much appreciate this. Just send me a msg to exchange skype.

Thank you for reading!


Edit:

Ok, I used the following python script to import the tweets to a postgresql:

#!/usr/bin/python

import sys
import psycopg2

db = psycopg2.connect(host="localhost", port=12345, database="db", user="postgres", password="pw")

class Tweet:
def Tweet(self, date, user, text):
self.date = date
self.user = user
self.text = text

def insert_into_db(tweet):
global db
print "insert ", tweet.date, tweet.user, tweet.text
try:
db.set_isolation_level(0)
cursor = db.cursor()
cursor.execute("""INSERT INTO tweets (timestamp, userid, tweet) VALUES (%s, %s, %s)""", (tweet.date, tweet.user, tweet.text))
db.commit()
except Exception as e:
print "ERROR", e

current = Tweet

def process_data(piece):
global current
for line in piece.split("\n"):
#print line
if (line.startswith("T\t")):
current.date = line[2:]
if (line.startswith("U\t")):
current.user = line[2 + len("http://twitter.com/"):]
if (line.startswith("W\t")):
current.text = line[2:]
insert_into_db(current)
current = Tweet



def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

f = open(sys.argv[1])
for piece in read_in_chunks(f):
    process_data(piece)



And I use the following structure in rapidminer (taken from bi cortex example):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_database" compatibility="5.3.007" expanded="true" height="60" name="Read Database" width="90" x="45" y="120">
        <parameter key="connection" value="test_data"/>
        <parameter key="query" value="(SELECT &quot;tweet&quot;, &quot;id_sent&quot;, &quot;sentiment_boo&quot;&#10;FROM &quot;public&quot;.&quot;sentiment&quot;&#10;WHERE &quot;sentiment_boo&quot; = 't'&#10;limit 10000)&#10;union&#10;(SELECT &quot;tweet&quot;, &quot;id_sent&quot;, &quot;sentiment_boo&quot;&#10;FROM &quot;public&quot;.&quot;sentiment&quot;&#10;WHERE &quot;sentiment_boo&quot; = 'f'&#10;limit 10000)&#10;"/>
        <parameter key="table_name" value="Sample_Feeds"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (3)" width="90" x="179" y="120">
        <parameter key="attribute_name" value="id_sent"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.3.007" expanded="true" height="76" name="Nominal to Text" width="90" x="246" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="tweet"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="120">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_above_percent" value="90.0"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="30">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role" width="90" x="447" y="120">
        <parameter key="attribute_name" value="sentiment_boo"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.3.007" expanded="true" height="112" name="Validation" width="90" x="581" y="120">
        <parameter key="number_of_validations" value="5"/>
        <process expanded="true">
          <operator activated="true" class="select_attributes" compatibility="5.3.007" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="30">
            <parameter key="attribute_filter_type" value="no_missing_values"/>
            <parameter key="attribute" value="text"/>
          </operator>
          <operator activated="true" class="nominal_to_binominal" compatibility="5.3.007" expanded="true" height="94" name="Nominal to Binominal" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="sentiment_boo"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="support_vector_machine_linear" compatibility="5.3.007" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="210"/>
          <connect from_port="training" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
          <connect from_op="Nominal to Binominal" from_port="example set output" to_op="SVM (Linear)" to_port="training set"/>
          <connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="5.3.007" expanded="true" height="76" name="Apply Model" width="90" x="45" y="75">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.3.007" expanded="true" height="76" name="Performance" width="90" x="179" y="120"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="read_database" compatibility="5.3.007" expanded="true" height="60" name="Read Database (2)" width="90" x="45" y="210">
        <parameter key="connection" value="test_data"/>
        <parameter key="query" value="SELECT &quot;feed&quot;, &quot;id_serial&quot;&#10;FROM &quot;public&quot;.&quot;test_data&quot;&#10;limit 100"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (4)" width="90" x="179" y="210">
        <parameter key="attribute_name" value="id_serial"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.3.007" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="246" y="345">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="feed"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="313" y="210">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_above_percent" value="90.0"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (2)" width="90" x="112" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases (2)" width="90" x="246" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="380" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="514" y="30">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.007" expanded="true" height="76" name="Set Role (2)" width="90" x="447" y="210">
        <parameter key="attribute_name" value="text"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.3.007" expanded="true" height="76" name="Apply Model (2)" width="90" x="648" y="345">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="training" to_port="result 1"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
      <connect from_op="Read Database (2)" from_port="output" to_op="Set Role (4)" to_port="example set input"/>
      <connect from_op="Set Role (4)" from_port="example set output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

Sign In or Register to comment.