The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

Mining Twitter - Data loops

timeitselftimeitself Member Posts: 1 Learner III
edited November 2018 in Help
Hi all.
Working on my PhD dissertation, I downloaded ~5K tweets in a JSON format, placed them in a MongoDB database, extracted re-tweet graph data to be analyzed by Gephi/NodeXL, extracted text for a semantic analysis with RapidMiner.

Tweets texts are in a CSV (I could extract them in other formats as well), 1 tweet text per row, for a total of ~5K rows.

I need to analyze every tweet to get something close to a semantic value, that for a very first round could be a list of the words (per each of the tweets), after tokenization, n-gramming and filtering stopwords. I will extract a semantic value out of the words after that (by word-based semantic distance).

I'm far from being proficient in RapidMiner (my apologies!) and what I got reading the CSV file is a list of words for all the tweets, not the individual ones.

I would probably need a loop starting from the 1st row, processing it and iterate till the end of the rows.
I couldn't find a way to use the loops operators in the proper way ...

Your help would be highly appreciated!

Thanks
Carlo

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Carlo,

    I suppose you are the Process Documents from Data operator. Like any other Process Documents operator, it provides two outputs: the word vector, which indeed delivers global statistics, but also an example set, which contains word counts for every single document. If you switch the vector_creation to Term Occurrences, you get absolute numbers. For classification/regression tasks etc. however, you usually will use the TF/IDF norm.

    Best regards,
    Marius
Sign In or Register to comment.