Cleaning twitter data
I'm new to RapidMiner, and I am struggling to understand how the Filter commands can be used to clean up twitter feeds. I am importing these from a CSV file and am trying to create sub-processes within the process documents operator to remove twitter handles (@), RT and hashtags. I have tried for example to use Filter Tokens by Content specifying that the condition is contains the string @. Although the process runs without errors I cannot see in the results that the twitter handles were removed. Can anybody please advise on how to go about cleaning up the data?
Tagged:
0
Answers
When you load in the tweets from CSV they will come in as a Nominal datatype. To use the Filter Tokens by Content, you would need to convert those tweets into a Text data type via a Nominal to Text operator.
Here's a sample using the Search Twitter operator that does some cleaning.