The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Clustering of the Text
gunjanamit
Member Posts: 28 Contributor II
I wanted to cluster the survey comments in different categories like
Comment Category
Restrooms Stinks FMG
Food was costly Restaurant
Poor service in restaurant Restaurant
I want to read to read the comments from excel and write it back in excel with Category.
Can anyone please suggest how to do this?
Comment Category
Restrooms Stinks FMG
Food was costly Restaurant
Poor service in restaurant Restaurant
I want to read to read the comments from excel and write it back in excel with Category.
Can anyone please suggest how to do this?
0
Answers
if you already know which categories you are looking for, you should label your training data manually with these categories and then train a classification algorithm on it. A good choice for text processing could be the SVM.
If you can't or don't want to label your data, just run a clustering algorithm like k-Means on your preprocessed documents, and have a look at the clusters afterwards to see if they make sense for you.
Best, Marius
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
<process expanded="true" height="252" width="681">
<operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
<parameter key="excel_file" value="C:\Users\guagg\Desktop\All\RapidMiner\read.xls"/>
<parameter key="imported_cell_range" value="A1:A6"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="k_means" compatibility="5.2.006" expanded="true" height="76" name="Clustering" width="90" x="313" y="75">
<parameter key="add_as_label" value="true"/>
<parameter key="remove_unlabeled" value="true"/>
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NominalMeasures"/>
<parameter key="nominal_measure" value="RussellRaoSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="5.2.006" expanded="true" height="76" name="Numerical to Binominal" width="90" x="514" y="120"/>
<connect from_op="Read Excel" from_port="output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
But its not giving me correct results.
Results
cluster_0 I love food
cluster_1 washroom stinks
cluster_2 service is poor
cluster_0 food is great
cluster_0 not great service
Last one should be Cluster 2 not Cluster 0.
Please suggest!!!
Best, Marius
I cant find the link. Please give again.
Regards
gunjan