How best to analyse tweets? (Also help with rule association problem)

jdvstantonjdvstanton Member Posts: 3 Contributor I
edited December 2018 in Help

A colleague and I are currently carrying out clustering (K-Means and DBscan) as well as rule association on about 30000 tweets for a project, unfortunately after many attempts we still find incoherent data or results which despite our best efforts has resulted in few conclusions about the data.

 

Other than sentiment analysis which I would like to carry out if I have time but is rather difficult (so I have been told) what else could I do?

 

I am having some difficulty in particular with rule association, I managed to carry out rule association on the text but I would also like to include the time the tweet was sent. Unfortunately when I carry out the process the rules include the words "Time_sent" without the time actually stated in the rules. How can I fix this?

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I do a lot of Twitter analysis with the Text Mining extension, clustering and use association rules quite a bit. A large row count shouldn't scare you away it's all the tokens that you generate that'll slow the process down. Do you do a lot of pruning when you process? I spend at a lot of time in data prep and I selectively tokenize hashtags, links, and twitter handles. 

  • jdvstantonjdvstanton Member Posts: 3 Contributor I

    Hi Tom, we did spend a lot of time preparing the data, I am not sure how well we did however we managed to reduce the number of columns of word attributes from about 7000 to 900/1000 for every document we processed. 

    I managed to make some sense of the rules of association I used however unfortunately it seems as though there is not much to say regarding the data.

     

    The hashtag is not a problem, the data given contained only one distinct hashtag so we just removed the attribute, they were all related already luckily. I use a percentual pruning method in the document process (below percent = 0.09/0.1, above percent = 100)

     

    I feel though that I have made more progress than my colleague who cannot make sense of the cluster data, I have also tried helping him but the data is quite strange. I am not sure how to help him.

     

    Should I conduct sentiment analysis? Or is it not necessary?

  • jdvstantonjdvstanton Member Posts: 3 Contributor I

    @Thomas_Ott wrote:

    I do a lot of Twitter analysis with the Text Mining extension, clustering and use association rules quite a bit. A large row count shouldn't scare you away it's all the tokens that you generate that'll slow the process down. Do you do a lot of pruning when you process? I spend at a lot of time in data prep and I selectively tokenize hashtags, links, and twitter handles. 



    Hi Tom, we did spend a lot of time preparing the data, I am not sure how well we did however we managed to reduce the number of columns of word attributes from about 7000 to 900/1000 for every document we processed. 

    I managed to make some sense of the rules of association I used however unfortunately it seems as though there is not much to say regarding the data.

     

    The hashtag is not a problem, the data given contained only one distinct hashtag so we just removed the attribute, they were all related already luckily. I use a percentual pruning method in the document process (below percent = 0.09/0.1, above percent = 100)

     

    I feel though that I have made more progress than my colleague who cannot make sense of the cluster data, I have also tried helping him but the data is quite strange. I am not sure how to help him.

     

    Should I conduct sentiment analysis? Or is it not necessary?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I guess the question is, what's the ultimate goal of this analysis? That will help form which direction to take.

Sign In or Register to comment.