Review_Classification

ArmMiner · September 2012

Hi

I want to classify the reviews in the online-shops. For the beginning, I just copied the reviews in an Excel sheet, so that every review is in a seperate row. I want to have three labels (fast delivery, Package is fine, Will use again). The problem that I face is, that customers are writing reviews in several sentences, so I can't figure out wich operators could be helpful in this case.
Hopefully I could clarify the problem.

Thanks in advance!

MariusHelf · September 2012

Hi,

you will have to use the Text and the Web Mining Extension. Please find some introductory tutorials on our tutorials page: http://rapid-i.com/content/view/189/198/

Best,
Marius

ArmMiner · September 2012

Thanks for the reply!
Actually I watched many videos on this topic and I've used TextMining extension. I just would like to know how can I use these really cool operators for my problem. Now I'm doing the examples with excek sheet, where I've put the customer's reviews (50 reviews) with the corresponding labels and considered it as a training set. Then I created the other sheet with the test data (around 10 reviews) but without labels. And then I used the Default Model operator, Apply Model. So, the problem is that, customers are writing reviews in different orms and lengths and using different symbols.
Any Idea?
Thanks!

Skirzynski · September 2012

Hi

The solution for your problem is transforming your data into a word vector, e.g. with the "Process Documents"-operator. This means that you create a new attribute for every term in every text you have.

For every example (in your case a review) you can use their binary occurrence (is this term contained in the review?) to set its value for a given attribute. A more sophisticated way would be to use "term frequency" or the more common used "TFIDF" method to determine the attribute values. With this method you don't have to care about several sentences in different lengths.

After this transformation you can apply a learner to learn from your labels. A good choice in text mining is the SVM.

I hope this helps!

ArmMiner · September 2012

Hi

Thanks for the help!
I tried tu use the "ProcessDocuments", but the result is not that was expected. So, I would like to describe the sequence of operators that I used.

ReadExcel ---> NominalToText--->DataToDocumetns--->ProcessDocuments(inside I used Tokenize--->FilterStopWords(Dictionary)).

Sorry, for so many questions!
Thanks!

Armen

Skirzynski · September 2012

Your described sequence of operators seems correct. What do you mean with "not expected"?

ArmMiner · September 2012

I mean when I have in result the table with the binary occurences of each term, what I have to do in the next? For example, when I have in one row of the sheet this review:
"Everything was super. The Delivery is very fast"
So, after the beforementioned operators I used, in result I have a table in wich every word of this review is in a separate column and there are also binary occurences. It's fine I think. Now, in the next step shall I use the SVM? I'm not sure about that, because my classification has to have 3 labels.
In this example it would be "Fast delivery".
Sorry, I'm just a beginner in RapidMiner.

Thanks!

Skirzynski · September 2012

Yes, your next step is to use a learner. But you're not forced to use the SVM. After the transformation you have a numerical dataset on which you can use any learner you like, e.g. NaiveBayes. But there also exists a MultiClass-SVM operator (LibSVM).

After you have learned the model from your training-data you can transform your test-data in the same way in this word vectors and apply the model on them to predict the label.

Two more points:

Use TFIDF instead of binary occurrence
Don't forget to use stop-word filter. In your example these are words like "was" or "is" which are not helpful for a classification.

ArmMiner · September 2012

Thanks a lot!
I used LibSVM, but it wants as an input exapmple st which has special attribute 'label'. In my training set I have this column, but the problem is that when I use SetRole and SelectAttributes operators, in the dropdown list 'name' of the SetRole there is nothing. And if I write there just label and run the process, it take an infinite time.
Thanks!

Skirzynski · September 2012

You're welcome!

The name parameter is the name of the attribute which role has to be changed. You have to enter the label name and not the role (which is the parameter "target role", where you have to select "label"). If you connect your ExampleSet to the input port of the "Set Role" operator the meta data mechanism should provide you the available attribute names, but sometimes this does not work for several reasons.

In your case i think there is no infinite loop, but the learner is still computing. You can check which operator is currently running in the status bar on the bottom of RapidMiner or by the symbol in the operator. If you see a green play-button this operator is currently crunching numbers.

ArmMiner · September 2012

Currently The SelectAttribute operator is running, but it seems this process is not going to finish run. The message is the following:
"Cannnot check whether input example set has special attribute ''label".
Any idea how can I fix this problem? Maybe my training data is in excel sheet, that's why I can't see the names in the dropdown list.
Thanks!

Skirzynski · September 2012

What do you do with the "Select Attribute" operator? You can post your process' XML (please use the forum's code tags) and i can take a look at your process design.

ArmMiner · September 2012

I just watched videos from Neil Mcguigan and there he is using those operators.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="415" width="762">
      <operator activated="true" class="read_excel" compatibility="5.2.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
        <parameter key="excel_file" value="C:\Users\MP-TEST\Desktop\Training Data.xls"/>
        <parameter key="imported_cell_range" value="A1:C51"/>
        <list key="annotations"/>
        <parameter key="locale" value="German (Germany)"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="75"/>
      <operator activated="true" class="text:data_to_documents" compatibility="5.2.004" expanded="true" height="60" name="Data to Documents" width="90" x="313" y="75">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="447" y="30">
        <process expanded="true" height="414" width="762">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30"/>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="380" y="165">
            <parameter key="stop_word_list" value="Sentiment"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="313" y="210">
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.2.008" expanded="true" height="76" name="SVM" width="90" x="447" y="210">
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model" width="90" x="581" y="210">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="original" to_op="SVM" to_port="training set"/>
      <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="SVM" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Skirzynski · September 2012

I don't know the video, but he probably used the operator to filter out some unnecessary attributes. If your dataset just cotains the label and the attribute with the text, you do not need this operator. Just use the "Set Role" operator and enter the name of your label attribute and set it to the role "label".

ArmMiner · September 2012

Yeah, I'm trying, but there is a process failure:

The attribute 'Label' does not exist.
The example set does not contain an attribute with the given name.

Maybe I have to use .*csv file?

ArmMiner · September 2012

With the .*csv file is the same problem. If I type in the attribute name and run . it shows that the process is running, but it takes infinite time, and also in the messages is written that the attribute hasn't been found.
Any idea?
Thanks!

Skirzynski · September 2012

My only idea is to listen to the process failure message. If it says there is no label with the name "Label" there is probably no such attribute in your data. Set a breakpoint before the operator and look on the created example set. And keep in mind that there is a difference between 'normal' and 'special' attributes and you have to include special attributes explicitly.

ArmMiner · September 2012

Hi Marcin!
Thanks for the help, now the setRole operator works. I just want to know how my test data has to look like. I mean in case of training data I have the column of labels, but now when I wanr to classify test data, what should I write in the field 'Name' of the setRole.
Thanks in advance!

Skirzynski · September 2012

If you do not want to measure the performance (i.e. how good your classification is) you do not need a label. Therefore your data has to look like the training data except the label.

ArmMiner · September 2012

Ok, but what should be written in the fields of setRole operator?

Skirzynski · September 2012

If you do not have an attribute to set the role than you do not need this operator for the test data at all.

ArmMiner · September 2012

Ok, good.
I'm not sure how to go further. Actually, the confusing thing is that there are some reviews having some parts that can be classified in all three classes.
For example one of the reviews:
Everything is ok. Very fast delivery, Will use again!

So, this can be classified with all three classes (Everything is ok, fast delivery, will use again). And the most of the reviews have this kind of structure. I was thinking to use the Cut Document operator, but actually I couldn't figure out how can I use it for this porpuses. Finally, when I just run my model, the predictions are all the same (e.g., all are Fast Delivery).The model looks which label is used frequently and predicts the result. So, which operator would help me to make model look into the review and classify correctly.
Maybe, it's very simple, but I'm not familiar with this tools, that's why I'm confused in using the corresponding operators.

Thanks!

Skirzynski · September 2012

The easy way:
What you want to do is to predict three different binary labels. I would suggest that you begin with a simple approach and take one label (e.g. 'Very fast delivery') where you want to know if this is true or not. So your multi class problem is now a traditional two class problem. If this works you can repeat this for the other two. In the end you will have three different models, which you will have to apply and which yield three predicted labels. These indicate if everything is ok, it was a fast delivery or if you will use this product again.

ArmMiner · September 2012

Nice idea!
Thanks a lot for Your tollerance!
I will try it and give feedback!

ArmMiner · September 2012

Hi
I've tried with two labels, but the classification results are not satisfactory. As I mentioned in previos posts, my training data consists of 50 example reviews, test data consists of 10 reviews. So, in training data I've put the label 'Fast delivery' and the other 'xxx', just to see which reviews belong to class 'Fast delivery'. The problem is in the following: in training data reviews that have been labeled as 'Fast delivery' in quantity are more than those that are labeled as 'xxx'. So, the model takes into account only this criterion and in result whole test data is classified as 'Fast delivery' and with a lot of wrong lassifications.
You can take a look at the xml.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="415" width="762">
      <operator activated="true" class="read_excel" compatibility="5.2.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="C:\Users\MP-TEST\Desktop\Training Data - Schnell.xls"/>
        <parameter key="imported_cell_range" value="A1:C51"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations"/>
        <parameter key="locale" value="German (Germany)"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="ID.true.integer.id"/>
          <parameter key="1" value="Bewertung.true.text.attribute"/>
          <parameter key="2" value="Label.true.text.label"/>
        </list>
      </operator>
      <operator activated="true" class="read_excel" compatibility="5.2.008" expanded="true" height="60" name="Read Excel (2)" width="90" x="45" y="210">
        <parameter key="excel_file" value="C:\Users\MP-TEST\Desktop\Training Data - Schnell.xls"/>
        <parameter key="sheet_number" value="2"/>
        <parameter key="imported_cell_range" value="A2:B12"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations"/>
        <parameter key="locale" value="German (Germany)"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="ID.true.integer.id"/>
          <parameter key="1" value="Bewertung.true.text.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="120"/>
      <operator activated="true" class="text:data_to_documents" compatibility="5.2.004" expanded="true" height="60" name="Data to Documents" width="90" x="246" y="30">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.2.004" expanded="true" height="94" name="Process Documents" width="90" x="380" y="30">
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <process expanded="true" height="414" width="762">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="112" y="40">
            <parameter key="language" value="German"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_german" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="447" y="75">
            <parameter key="stop_word_list" value="Sentiment"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
          <connect from_op="Filter Stopwords (German)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="514" y="120">
        <parameter key="name" value="Label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="default_model" compatibility="5.2.008" expanded="true" height="76" name="Default Model" width="90" x="581" y="30"/>
      <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model" width="90" x="581" y="210">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Read Excel (2)" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Default Model" to_port="training set"/>
      <connect from_op="Default Model" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Review_Classification

Answers