RapidMiner

Calculating word frequency using Rapidminer

by RMStaff on ‎06-29-2016 10:08 AM

 

This article talks about a sample process to find word frequency in unstructured text mining.

 

The basic operators you need for building a process like this are 

  • Some datasource (In the example we are using Twitter, click here to see details about how to use twitter)
  • Nominal to Text" . This is to change data type for the process document operator to work on. Please note that only the "Text" data type columsn are processed by the text mining extension.
  • One of the "Process Documents..." operator depending on what your data source is.
  • Tokenize (Splits documents into sequence of tokens)

 Please see the "basic word frequency.rmp" file attached with this article to see a working example

Your process would look like

basicprocess setup.png

Inside the Process Documents from Data will look like

wordfrequency.png

The output of this will look something like this. (Please notice that your words may appear different for the exact same process since it is actually getting the twitter data.The word frequency or the WordList output is delivered via the "Wor" port of the "Process Documents from Data" operator.

Total Occurences -  Tell you how many times the word appeared across all the examples. 

Document Occurences -  Tells you the number of individual documents the word appeared in.

 

basicwordfrequency.png

 

As you will notice in the output there are several unwanted words, or same words handled as two different words because of difference in cases, or there are commmon english words that you do not care about or some specific words that you may not  be interested in. All of these cases can then be handled by enriching the steps taken in "Process Documents from Data". Your improved "Process Documents from Data" sub process may look somehting like below 

improved  word frequency.png

 

 

Here are the reasons for using these operators 

  • Filter Stopwords(English) : This operator removes common english words like a, and, then.. 
  • Transform cases :basically converting everything to one case i.e lower of upper
  • Filter Tokens(By Length) : Removes word shorter than and longer than configured number of characters
  • Filter Stopwords(Dictionary) : This operator provides the ability to drop certain words. The list can be provided by a simple text file with each words to ignore on a new line. See sample attached file
Comments
jason
Contributor

This method has been very helpful. Can you please advise as to how to Filter the total number of occurances. 

 

For example: I want to elimnate words and phrases that only occur once in the document. (or twice, or ten times) So I only get high frequency words in my list. This would assit greatly in having a more managable list of examples. 

 

Thanks!

RMStaff
RMStaff

Dear Jason,

 

this is called pruning. If you have  a look on the options of the Process Documents operator you can see some ways to do it.

 

Best,

Martin

jesus_martinez_
Contributor

Very helpful and well explained.

I wonder whether I can also obtain multi-word occurences. That is, if the word "Super" is always followed by "Bowl" I would also like to obtain in my list the occurences of the term "Super Bowl". Same for other common expressions that are always repeated in my data, such as "nice job" or "well done".

 

Thanks in advance.

RMStaff
RMStaff

Dear Jesus,

 

it's called n-gram. If you add a n-gram operator after transform cases you well get exaclty these combinations as well. Combinations of length two are called 2-grams and are seperated by a _.

 

Best,

Martin