RapidMiner

Sentiment Analysis as a supervised learning problem

by RMStaff on ‎07-02-2016 06:50 AM - edited on ‎09-21-2016 10:44 AM by Community Manager

 

 

Rapidminer provides multiple ways to do sentiment analysis. A very commonly used and powerful solution for sentiment analysis is training a model based on historical information or training set and then building a predictive model using that. Historical information may be available if in the past certain content was manually coded into different sentiment values. If not one will have to do a preparation step where a good sample should be manually classified as positive or negative sentiment. This is a one time effort and having a good training set will lead to better models and better predictions.

 

Please use this example along with the provided sample process (Attached as a zip file with this article)

 

An example of training set we will use today is as seen below. (It is also attached in the zip file attached with this article)

textmining training set.png

 

The process to build a model using this would involve at least following operators

  • Read Excel (To read the sample data)
  • Nominal to text (To specify which column is a text column, since Rapidminer "Process Documents..." Operators work only on text data
  • Process Documents from Data (This is the meta process for most text processing capabilities)
  • Tokenize (This will be used to tokenize the content into words, n grams etc as needed)

The actual process will look like this for the processing of training text

supervised sentiment analysis.png

 

Inside the "Process  Documents from Data' operator we will have one step for the basic process, i.e Tokenize

 

supervised sentimenet basic.pngWe will later on work on improviing this sub process if needed.

The output of "Process Documents from Data" will be your tokenized exampleset as well as a wordlist.

 

Now we can build a cross validation step using our "Tokenized example set". We will also need to add the "Set Role" operator to specify our Label (i.e target) variable.

The process should look something like this.

basic supervised add validation.png

 

To know more about validation, please look at these links

Inside the validation operator we can use any of the learners. For text mining use cases, Naive Bayes is many times good and fast. You can also try SVM or Neural Nets but that increases the computational  complexity of the solution.

 

The validation step provides the model as well as information of performance of the model. "mod" provides the model and "ave" provides the performance.

In our case for the basic example when using Naive Bayes our accuracy confusion matrix looks like

basic performance.png

 

When using SVM our confusion matrix looks like

svm performance.png

 

We will explore in a later article on how to improve on text processing. But for now lets assume this a good model.

 

Now to use this predictive model we will basically do similar process on the actual data set and then apply the model on the tokenized dataset.

One addtional step we need to do is, pass the wordlist from the training "Process Document from Data" operator to the scoring "Process Document from Data"

You process will look something like this.

apply basic model.png

The output from the Apply model will have three special columns. as seen the screen shot below

Prediction(Sentiment) - Actual class

confidence(negative)

confidence(positive)

final output.png

You can then add additonal text processing operators as needed in your use case to improve on your model

 

A sample detailed "Process Documents from data" with more pre processing will look something like below.

Please ensure that you do the same steps on the scoring side to get correct results. Using Building Blocks is helpful here. 

 

 

detailed process mining.png