Calculating word frequency using Rapidminer
This article talks about a sample process to find word frequency in unstructured text mining.
The basic operators you need for building a process like this are
- Some datasource (In the example we are using Twitter, click here to see details about how to use twitter)
- Nominal to Text" . This is to change data type for the process document operator to work on. Please note that only the "Text" data type columsn are processed by the text mining extension.
- One of the "Process Documents..." operator depending on what your data source is.
- Tokenize (Splits documents into sequence of tokens)
Please see the "basic word frequency.rmp" file attached with this article to see a working example
Your process would look like
Inside the Process Documents from Data will look like
The output of this will look something like this. (Please notice that your words may appear different for the exact same process since it is actually getting the twitter data.The word frequency or the WordList output is delivered via the "Wor" port of the "Process Documents from Data" operator.
Total Occurences - Tell you how many times the word appeared across all the examples.
Document Occurences - Tells you the number of individual documents the word appeared in.
As you will notice in the output there are several unwanted words, or same words handled as two different words because of difference in cases, or there are commmon english words that you do not care about or some specific words that you may not be interested in. All of these cases can then be handled by enriching the steps taken in "Process Documents from Data". Your improved "Process Documents from Data" sub process may look somehting like below
Here are the reasons for using these operators
- Filter Stopwords(English) : This operator removes common english words like a, and, then..
- Transform cases :basically converting everything to one case i.e lower of upper
- Filter Tokens(By Length) : Removes word shorter than and longer than configured number of characters
- Filter Stopwords(Dictionary) : This operator provides the ability to drop certain words. The list can be provided by a simple text file with each words to ignore on a new line. See sample attached file