Basic Text Mining From an Excel File
I would really appreciate some help / direction on how to tackle a basic text mining task. Basically, I have a spreadsheet that has one column that I am interested in, the column is titled: "Hashtags." I would like to count the occurrences of each unique hashtag, and output the number of occurrences of each, using RapidMiner.
A single row might have several hashtags in one cell, for example, row #1's value is: "12YearsASlave Oscars2014 AmericanHustle AcademyAwards2014" -- which means there are FOUR hashtags here and should each count towards the count of the four unique hashtags. Hence, I will need to tokenize every row's value.
If the tokenization is complex, I can ignore this bit and treat each row as one hashtag for now. My dataset is very large so I can ignore the rows that have multiple hashtags in one cell to get it to work.
I tried using SelectAttributes, Tokenize and DataToDocument but I am hitting a wall.
Any help / direction is appreciated, and hope this isn't too basic. Thanks for your help!