Sentiment Analysis as a supervised learning problem
Rapidminer provides multiple ways to do sentiment analysis. A very commonly used and powerful solution for sentiment analysis is training a model based on historical information or training set and then building a predictive model using that. Historical information may be available if in the past certain content was manually coded into different sentiment values. If not one will have to do a preparation step where a good sample should be manually classified as positive or negative sentiment. This is a one time effort and having a good training set will lead to better models and better predictions.
Please use this example along with the provided sample process (Attached as a zip file with this article)
An example of training set we will use today is as seen below. (It is also attached in the zip file attached with this article)
The process to build a model using this would involve at least following operators
- Read Excel (To read the sample data)
- Nominal to text (To specify which column is a text column, since Rapidminer "Process Documents..." Operators work only on text data
- Process Documents from Data (This is the meta process for most text processing capabilities)
- Tokenize (This will be used to tokenize the content into words, n grams etc as needed)
The actual process will look like this for the processing of training text
Inside the "Process Documents from Data' operator we will have one step for the basic process, i.e Tokenize
We will later on work on improviing this sub process if needed.
The output of "Process Documents from Data" will be your tokenized exampleset as well as a wordlist.
Now we can build a cross validation step using our "Tokenized example set". We will also need to add the "Set Role" operator to specify our Label (i.e target) variable.
The process should look something like this.
To know more about validation, please look at these links
Inside the validation operator we can use any of the learners. For text mining use cases, Naive Bayes is many times good and fast. You can also try SVM or Neural Nets but that increases the computational complexity of the solution.
The validation step provides the model as well as information of performance of the model. "mod" provides the model and "ave" provides the performance.
In our case for the basic example when using Naive Bayes our accuracy confusion matrix looks like
When using SVM our confusion matrix looks like
We will explore in a later article on how to improve on text processing. But for now lets assume this a good model.
Now to use this predictive model we will basically do similar process on the actual data set and then apply the model on the tokenized dataset.
One addtional step we need to do is, pass the wordlist from the training "Process Document from Data" operator to the scoring "Process Document from Data"
You process will look something like this.
The output from the Apply model will have three special columns. as seen the screen shot below
Prediction(Sentiment) - Actual class
You can then add additonal text processing operators as needed in your use case to improve on your model
A sample detailed "Process Documents from data" with more pre processing will look something like below.
Please ensure that you do the same steps on the scoring side to get correct results. Using Building Blocks is helpful here.