Mining Twitter - Data loops

timeitself · December 2012

Hi all.
Working on my PhD dissertation, I downloaded ~5K tweets in a JSON format, placed them in a MongoDB database, extracted re-tweet graph data to be analyzed by Gephi/NodeXL, extracted text for a semantic analysis with RapidMiner.

Tweets texts are in a CSV (I could extract them in other formats as well), 1 tweet text per row, for a total of ~5K rows.

I need to analyze every tweet to get something close to a semantic value, that for a very first round could be a list of the words (per each of the tweets), after tokenization, n-gramming and filtering stopwords. I will extract a semantic value out of the words after that (by word-based semantic distance).

I'm far from being proficient in RapidMiner (my apologies!) and what I got reading the CSV file is a list of words for all the tweets, not the individual ones.

I would probably need a loop starting from the 1st row, processing it and iterate till the end of the rows.
I couldn't find a way to use the loops operators in the proper way ...

Your help would be highly appreciated!

Thanks
Carlo

MariusHelf · December 2012

Hi Carlo,

I suppose you are the Process Documents from Data operator. Like any other Process Documents operator, it provides two outputs: the word vector, which indeed delivers global statistics, but also an example set, which contains word counts for every single document. If you switch the vector_creation to Term Occurrences, you get absolute numbers. For classification/regression tasks etc. however, you usually will use the TF/IDF norm.

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Mining Twitter - Data loops

Answers