Sentiment Analysis as a supervised learning problem
Rapidminer provides multiple ways to do sentiment analysis. A very commonly used and powerful solution for sentiment analysis is training a model based on historical information or training set and then building a predictive model using that. Historical information may be available if in the past certain content was manually coded into different sentiment values. If not one will have to do a preparation step where a good sample should be manually classified as positive or negative sentiment. This is a one time effort and having a good training set will lead to better models and better predictions.
Please use this example along with the provided sample process (Attached as a zip file with this article)
An example of training set we will use today is as seen below. (It is also attached in the zip file attached with this article)
The process to build a model using this would involve at least following operators
- Read Excel (To read the sample data)
- Nominal to text (To specify which column is a text column, since Rapidminer "Process Documents..." Operators work only on text data
- Process Documents from Data (This is the meta process for most text processing capabilities)
- Tokenize (This will be used to tokenize the content into words, n grams etc as needed)
The actual process will look like this for the processing of training text
Inside the "Process Documents from Data' operator we will have one step for the basic process, i.e Tokenize
We will later on work on improviing this sub process if needed.
The output of "Process Documents from Data" will be your tokenized exampleset as well as a wordlist.
Now we can build a cross validation step using our "Tokenized example set". We will also need to add the "Set Role" operator to specify our Label (i.e target) variable.
The process should look something like this.
To know more about validation, please look at these links
Inside the validation operator we can use any of the learners. For text mining use cases, Naive Bayes is many times good and fast. You can also try SVM or Neural Nets but that increases the computational complexity of the solution.
The validation step provides the model as well as information of performance of the model. "mod" provides the model and "ave" provides the performance.
In our case for the basic example when using Naive Bayes our accuracy confusion matrix looks like
When using SVM our confusion matrix looks like
We will explore in a later article on how to improve on text processing. But for now lets assume this a good model.
Now to use this predictive model we will basically do similar process on the actual data set and then apply the model on the tokenized dataset.
One addtional step we need to do is, pass the wordlist from the training "Process Document from Data" operator to the scoring "Process Document from Data"
You process will look something like this.
The output from the Apply model will have three special columns. as seen the screen shot below
Prediction(Sentiment) - Actual class
You can then add additonal text processing operators as needed in your use case to improve on your model
A sample detailed "Process Documents from data" with more pre processing will look something like below.
Please ensure that you do the same steps on the scoring side to get correct results. Using Building Blocks is helpful here.
Thanks for the tutorial. Very helpfull (y)
thank you so much for this post!
Sadly some of the links you mention in the text are not included. I am fairly new to Rapidminer Studio and am trying to follow the process as described by you, with your test data. I reached the point "We will also need to add the "Set Role" operator to specify our Label (i.e target) variable." but sadly, Rapidminer Studio tells me that I need to choose an attribute name and a label but that's not a possibility, as my data (after putting it through Process Documents from Data) looks like down below.
Have you got any solution to this issue? And is there a version of your article, where the Links are included?
I wanna thank you so much for your input! Your post is awesome!
thanks for your response! The issue is that the attributes available are the ones you see in my picture above. When doing "set role" as described in the tutorial, the data I do it with is the one showed in the picture. It doesn't make sense though to choose any attribute, from my understanding, if the data you choose it from is the one in the picture.
That's why I don't get how to "set role", as it doesn't make sense with the data output.
So I'm asking myself where my mistake lies. I have worked with the data described as in the tutorial, No idea what the issue is.
Welcome to the RapidMiner Community!
The purpose of using Set Role for an attribute as a "label" is so that the algorithm later knows how to classify the data. Usually your Label attribute might be named "Sentiment" and have two values "Positive" and "Negative".
And, usually you have this attribute in your dataset BEFORE you pass the data into Process Documents from Data to extract all the new attributes from the text.
Maybe the trick here for you is to put your Set Role operator just before your Process Documents from Data operator and see if you can select an attribute for your label this time.
Hope that helps!