RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Basic Text Mining From an Excel File

monamahfouzmonamahfouz Member Posts: 4 Contributor I
Hi everyone,

I would really appreciate some help / direction on how to tackle a basic text mining task. Basically, I have a spreadsheet that has one column that I am interested in, the column is titled: "Hashtags." I would like to count the occurrences of each unique hashtag, and output the number of occurrences of each, using RapidMiner.

A single row might have several hashtags in one cell, for example, row #1's value is: "12YearsASlave Oscars2014 AmericanHustle AcademyAwards2014" -- which means there are FOUR hashtags here and should each count towards the count of the four unique hashtags. Hence, I will need to tokenize every row's value.

If the tokenization is complex, I can ignore this bit and treat each row as one hashtag for now. My dataset is very large so I can ignore the rows that have multiple hashtags in one cell to get it to work.

I tried using SelectAttributes, Tokenize and DataToDocument but I am hitting a wall.

Any help / direction is appreciated, and hope this isn't too basic. Thanks for your help!
Mona

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869   Unicorn
    Hi Mona,

    you don't need any Text Processing operators (in the RapidMiner sense) at all. First let's ignore the multi-tag rows:
    Load your data, and add a Filter Examples operator with the attribute_value_filter "Hashtag != .* .*" (without the quotes).
    Then add an Aggregate operator. Group by Hashtag and add the aggregation function count for Hashtag. That's it :)

    Best regards,
    Marius
Sign In or Register to comment.