When you do statistical based text analysis compared to natural language processing, sometimes words carry a different meaning when grouped together. For example, strategy by itself is just a noun there is no context involved. On the other hand, if it is paired with military, economic, or political it has a far different meaning. strategy, military strategy, economic strategy, and political strategy are all different ideas. Statistical based text processing will not extract the context of these words, but it will tell you how many times strategy, military, economic, and political shows up in your documents or data. This gives you information but lacks context. So the question is "How do you extract this context via statistical based text processing in RapidMiner?". The answer: Generate n-Grams (terms).
The operator’s algorithm is quite simple. Generate n-Grams (terms) will check for words that frequently follow one another. Following the example above, RapidMiner will pick out strategy and military as new attributes, each of which are words. Next it will say, often strategy is followed by military. Thus, it makes a new attribute military_strategy This selection can also be improved via pruning. There will be a footnote on pruning later. The result is now that there are 3 attributes: strategy, military, and military_strategy. Without having the machine understand the context, it was still able to identify grouped words in which the data scientist can now understand the context in which military strategy is related. The key here is that it bypasses the machines’ need to understand the language and pushes it onto the user while still maintaining groups of associated words within our data set.
You will need the Text Processing extension in order to use both the process documents operator and the generate n-grams operator. There are a few other operators you will need from this extension as well.
A guide to installing extensions including the text processing extension can be found here.
Text document. You can either create one or call a text file. This guide will encompass the first.
Step 1: Generating the Document
All that is needed here is to place a create document operator. Then proceed to parameters window and open the 'Edit Text' button. This should pull up a window that will allow you to type or insert text to be pushed through the process documents operator for text processing. Any amount of text can be added but these 2 sentences should do the trick:
"The distinction between our strategy and theirs is that ours is a true military strategy. Theirs is a poor excuse of military strategy that can be summed up by hit and run tactics with a side of cowardice."
This sample text will utilize the example we talked about above. This will allow RapidMiner to find the connection between military and strategy.
Step 2: Process Documents
The next step is to add in the text processing operator. Once its in place, there is a sub-process icon in the bottom right which denotes that there is another level to the operator. Double clicking will bring you into that level. Once there, a tokenize operator is needed. Then the "Any non-letters" option in the parameters tab needs to be selected. These two operators in conjunction will say, 'generate a word whenever two non-letters are separated by letter characters. For example ' Space.', the empty space and the period are the two non-letters that generate the word space. Next, a transform cases operator set to lowercase is also needed. This will ensure that "The" and "the" are pulled out as the same word.
Step 3: Generate n-Grams
The last operator needed is generate n-grams (terms). Here, a max length of two is a sufficient setting for the length parameter. There is not a large amount of text to use so there is no need for a length larger than two. The last task required is to connect the wordlist and the example set out puts to the result nodes on the right. Once this is hooked up, the process can be ran by either pressing the run button or F11.
Notice that strategy, military, and military_strategy were all pulled out as unique words. This is the desired result. There are also term frequencies associated with each attribute.
On the process documents operator, there is a parameter for pruning. If this is set to absolute pruning and that is set to two, then RapidMiner will only keep words in the document that show up two times or more. It will cut down the final result and only show frequent engrams. For this case, a low prune generates the desired result but pruning can be an extremely tedious task once there are thousands of words being processed.