Clustering of the Text

gunjanamitgunjanamit Member Posts: 28 Contributor II
edited November 2018 in Help
I wanted to cluster the survey comments in different categories like

Comment                                                    Category

Restrooms Stinks                                            FMG
Food was costly                                      Restaurant
Poor service in restaurant                        Restaurant

I want to read to read the comments from excel and write it back in excel with Category.

Can anyone please suggest how to do this?


  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn

    if you already know which categories you are looking for, you should label your training data manually with these categories and then train a classification algorithm on it. A good choice for text processing could be the SVM.
    If you can't or don't want to label your data, just run a clustering algorithm like k-Means on your preprocessed documents, and have a look at the clusters afterwards to see if they make sense for you.

    Best, Marius
  • gunjanamitgunjanamit Member Posts: 28 Contributor II
    I have followed the below process

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.006">
      <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
        <process expanded="true" height="252" width="681">
          <operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
            <parameter key="excel_file" value="C:\Users\guagg\Desktop\All\RapidMiner\read.xls"/>
            <parameter key="imported_cell_range" value="A1:A6"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          <operator activated="true" class="k_means" compatibility="5.2.006" expanded="true" height="76" name="Clustering" width="90" x="313" y="75">
            <parameter key="add_as_label" value="true"/>
            <parameter key="remove_unlabeled" value="true"/>
            <parameter key="k" value="3"/>
            <parameter key="measure_types" value="NominalMeasures"/>
            <parameter key="nominal_measure" value="RussellRaoSimilarity"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
          <operator activated="true" class="numerical_to_binominal" compatibility="5.2.006" expanded="true" height="76" name="Numerical to Binominal" width="90" x="514" y="120"/>
          <connect from_op="Read Excel" from_port="output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Numerical to Binominal" to_port="example set input"/>
          <connect from_op="Numerical to Binominal" from_port="example set output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>

    But its not giving me correct results.

    cluster_0 I love food
    cluster_1 washroom stinks
    cluster_2 service is poor
    cluster_0 food is great
    cluster_0 not great service

    Last one should be Cluster 2 not Cluster 0.

    Please suggest!!!

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    You are processing texts, so you should have a close look at the Text Extension. You'll find links to tutorials in the post linked in my signature.

    Best, Marius
  • gunjanamitgunjanamit Member Posts: 28 Contributor II

    I cant find the link. Please give again.

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Just click my sigature where it says in big red letters "click here" and read the first item in linked post.
Sign In or Register to comment.