Text clustering in RapidMiner Studio

dsackindsackin Member Posts: 5 Contributor I
edited October 2019 in Help

I'm trying to do an unsupervised clustering of text in RM. The data is in a .CSV file. One attribute is a text field with free text that I would like to cluster. I have configured this as a data source in my repository. I marked the field as type text. I also marked the id field as type id. I believe I need to create a word vector for each example in my set. I think I do this using "Process Documents from Data". I have this set for create word vector using TF-IDF. 

 

Inside of Process Documents, I have a tokenizer, case transformer, stopword filter, stemmer, and n-gram builder in sequence. I wired the output of Process Documents to the input of k-means clustering. Everything runs for a while and then halts with an error that the example set contains non-numeric values in a column. Is there a way to focus the clustering on only the attributes of interest (i.e. the terms found in process documents)? Or do I have to filter out the other attributes first?

 

I also tried switching the k-means measure type to mixed, but then I get an error that I have missing values.

 

All of the articles I read on clustering text describe the process I'm using, but it doesn't work for me. Please help.

Answers

  • dsackindsackin Member Posts: 5 Contributor I

    filtering the attributes did the trick. I stripped out everything except the id and text. the resulting term vectors came through and clustered. progress...

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,249 RM Data Scientist

    Dear dsackin,

    could you provide an example process? In general i would recommend to only cluster on the values returned by process documents i.e the TF/IDF values.

     

    ~martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • dsackindsackin Member Posts: 5 Contributor I

    I got this working. I had to use Select Attributes operator to filter the input to the process documents operator down to just an id and the text field. Then the output of the process documents was just id plus all of the term attributes and that document's TF-IDF score for each term.

     

    Now I'm trying to figure out how to assign a "top terms" summary to each cluster. I used Extract Cluster Prototypes on the Cluster Model output. I get a new example set with one example per cluster. Each example has a cluster label plus prototype scores for each term for each cluster. What I would like to do is find a way to pivot that somehow so I can get a list of terms and scores and sort and threshhold the top N for each cluster.

     

    Going from this:

    CLUSTER,BOAT,CAR,PLANE,TRAIN

    cluster_0, 0.02,0.31,0.23,0.00

    cluster_1, 0.22,0.01,0.0,0.0

     

    To this:

    CLUSTER,TERM,SCORE

    cluster_0,boat,0.02

    cluster_0,car,0.31

    cluster_0,plane,0.23

    cluster_0,train,0.00

    cluster_1,boat,0.22

    ...

     

    then group by cluster label, sort by score, and output top N scoring terms.

     

    I tried using both Transpose and Pivot on the Extract Cluster Prototypes results, but can't seem to get to what I think I need. I need help w/ that or some other way to generate descriptive labels for the resulting clusters. 

     

    Thanks

     

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,249 RM Data Scientist

    Hi,

     

    attached is my process to do a similar thing. I usually to a feature selection technique in a one vs all fashion. This answers the question "what are the most distinguishing attributes for this cluster". I use the top 3 features (=words) as a new name for the cluster.

     

    Taking the cluster centroid is a bit problematic. Just because it has a high value somewhere does not make this attribute important for the cluster.

     

    Best,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.3.000" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Sonar"/>
    </operator>
    <operator activated="true" class="k_means" compatibility="7.3.000" expanded="true" height="82" name="Clustering (3)" width="90" x="313" y="34">
    <parameter key="k" value="20"/>
    </operator>
    <operator activated="false" class="optimize_parameters_grid" compatibility="7.3.000" expanded="true" height="103" name="Optimize Parameters (Grid)" width="90" x="313" y="238">
    <list key="parameters">
    <parameter key="Clustering (2).k" value="[2.0;100.0;5;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="k_means" compatibility="7.3.000" expanded="true" height="82" name="Clustering (2)" width="90" x="112" y="34">
    <parameter key="k" value="100"/>
    </operator>
    <operator activated="true" class="cluster_distance_performance" compatibility="7.3.000" expanded="true" height="103" name="Performance" width="90" x="246" y="34"/>
    <operator activated="true" class="log" compatibility="7.3.000" expanded="true" height="82" name="Log" width="90" x="514" y="34">
    <list key="log">
    <parameter key="k" value="operator.Clustering (2).parameter.k"/>
    <parameter key="Performance" value="operator.Performance.value.DaviesBouldin"/>
    <parameter key="Distance" value="operator.Performance.value.avg_within_distance"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="Clustering (2)" to_port="example set"/>
    <connect from_op="Clustering (2)" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
    <connect from_op="Clustering (2)" from_port="clustered set" to_op="Performance" to_port="example set"/>
    <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
    <connect from_op="Log" from_port="through 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">&amp;quot;L-Bow Plot&amp;quot; to find k</description>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.3.000" expanded="true" height="82" name="Set Role" width="90" x="447" y="85">
    <parameter key="attribute_name" value="class"/>
    <parameter key="target_role" value="xxx"/>
    <list key="set_additional_roles">
    <parameter key="cluster" value="label"/>
    </list>
    </operator>
    <operator activated="true" class="loop_values" compatibility="7.3.000" expanded="true" height="82" name="Loop Values" width="90" x="581" y="34">
    <parameter key="attribute" value="cluster"/>
    <process expanded="true">
    <operator activated="true" class="replace" compatibility="7.3.000" expanded="true" height="82" name="Replace" width="90" x="45" y="187">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="cluster"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="replace_what" value="%{loop_value}"/>
    <parameter key="replace_by" value="ThisCluster"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.3.000" expanded="true" height="82" name="Replace (2)" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="cluster"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="replace_what" value="cluster_.*"/>
    <parameter key="replace_by" value="OtherCluster"/>
    </operator>
    <operator activated="true" class="weight_by_correlation" compatibility="7.3.000" expanded="true" height="82" name="Weight by Correlation" width="90" x="380" y="34"/>
    <operator activated="false" class="optimize_selection_forward" compatibility="7.3.000" expanded="true" height="103" name="Forward Selection" width="90" x="380" y="85">
    <process expanded="true">
    <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="124" name="Validation (2)" width="90" x="45" y="30">
    <parameter key="sampling_type" value="2"/>
    <process expanded="true">
    <operator activated="false" class="parallel_decision_tree" compatibility="7.3.000" expanded="true" height="82" name="Decision Tree (3)" width="90" x="45" y="238">
    <parameter key="criterion" value="gini_index"/>
    </operator>
    <operator activated="true" class="k_nn" compatibility="7.3.000" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
    <parameter key="k" value="10"/>
    </operator>
    <connect from_port="training" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (5)" width="90" x="45" y="30">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="7.3.000" expanded="true" height="82" name="Performance (5)" width="90" x="179" y="30"/>
    <connect from_port="model" to_op="Apply Model (5)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (5)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (5)" from_port="labelled data" to_op="Performance (5)" to_port="labelled data"/>
    <connect from_op="Performance (5)" from_port="performance" to_port="averagable 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
    </operator>
    <connect from_port="example set" to_op="Validation (2)" to_port="training"/>
    <connect from_op="Validation (2)" from_port="averagable 1" to_port="performance"/>
    <portSpacing port="source_example set" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="weights_to_data" compatibility="7.3.000" expanded="true" height="68" name="Weights to Data" width="90" x="514" y="34"/>
    <operator activated="true" class="filter_example_range" compatibility="7.3.000" expanded="true" height="82" name="Filter Example Range" width="90" x="648" y="34">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="3"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.3.000" expanded="true" height="82" name="Aggregate" width="90" x="782" y="34">
    <list key="aggregation_attributes">
    <parameter key="Attribute" value="concatenation"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.3.000" expanded="true" height="82" name="Generate Attributes" width="90" x="983" y="34">
    <list key="function_descriptions">
    <parameter key="Clustername" value="%{loop_value}"/>
    </list>
    </operator>
    <connect from_port="example set" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
    <connect from_op="Replace (2)" from_port="example set output" to_op="Weight by Correlation" to_port="example set"/>
    <connect from_op="Weight by Correlation" from_port="weights" to_op="Weights to Data" to_port="attribute weights"/>
    <connect from_op="Weights to Data" from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_example set" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="7.3.000" expanded="true" height="82" name="Append" width="90" x="715" y="34"/>
    <operator activated="true" class="replace_dictionary" compatibility="7.3.000" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="849" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="cluster"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="from_attribute" value="Clustername"/>
    <parameter key="to_attribute" value="concat(Attribute)"/>
    </operator>
    <connect from_op="Retrieve Sonar" from_port="output" to_op="Clustering (3)" to_port="example set"/>
    <connect from_op="Clustering (3)" from_port="clustered set" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Loop Values" to_port="example set"/>
    <connect from_op="Set Role" from_port="original" to_op="Replace (Dictionary)" to_port="example set input"/>
    <connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_op="Replace (Dictionary)" to_port="dictionary"/>
    <connect from_op="Replace (Dictionary)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • dsackindsackin Member Posts: 5 Contributor I

    Martin,

     

    Thanks for the guidance and sample process. I added your term weighting, concatentation, filtering, and dictionary lookup to my process. But it fails to run. I get an error dialog that says something like "Process failed. There are no obvious errors but you should run in debug mode or check the log"

     

    Here is the log:

    Dec 7, 2016 12:06:38 PM INFO: Loading initial data.
    Dec 7, 2016 12:06:38 PM INFO: Process //Local Repository/processes/datarole/clustering starts
    Dec 7, 2016 12:06:51 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
    Dec 7, 2016 12:06:51 PM SEVERE: Here:
    Dec 7, 2016 12:06:51 PM SEVERE: Process[1] (Process)
    Dec 7, 2016 12:06:51 PM SEVERE: subprocess 'Main Process'
    Dec 7, 2016 12:06:51 PM SEVERE: +- Retrieve MABostonPlumbing[1] (Retrieve)
    Dec 7, 2016 12:06:51 PM SEVERE: +- Sample[1] (Sample)
    Dec 7, 2016 12:06:51 PM SEVERE: +- Select Attributes[1] (Select Attributes)
    Dec 7, 2016 12:06:51 PM SEVERE: +- Process Documents from Data[1] (Process Documents from Data)
    Dec 7, 2016 12:06:51 PM SEVERE: subprocess 'Vector Creation'
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Transform Cases (3)[200] (Transform Cases)
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Tokenize (3)[200] (Tokenize)
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Filter Stopwords (English)[200] (Filter Stopwords (English))
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Stem (Snowball)[200] (Stem (Snowball))
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Generate n-Grams (Terms)[200] (Generate n-Grams (Terms))
    Dec 7, 2016 12:06:51 PM SEVERE: ==> +- X-Means[1] (X-Means)
    Dec 7, 2016 12:06:51 PM SEVERE: +- Set Role[0] (Set Role)
    Dec 7, 2016 12:06:51 PM SEVERE: +- Loop Values[0] (Loop Values)
    Dec 7, 2016 12:06:51 PM SEVERE: subprocess 'Iteration'
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Replace[0] (Replace)
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Replace (2)[0] (Replace)
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Weight by Correlation[0] (Weight by Correlation)
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Weights to Data[0] (Weights to Data)
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Filter Example Range[0] (Filter Example Range)
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Aggregate[0] (Aggregate)
    Dec 7, 2016 12:06:51 PM SEVERE: | +- Generate Attributes[0] (Generate Attributes)
    Dec 7, 2016 12:06:51 PM SEVERE: +- Append[0] (Append)
    Dec 7, 2016 12:06:51 PM SEVERE: +- Replace (Dictionary)[0] (Replace (Dictionary))
    Dec 7, 2016 12:06:51 PM SEVERE: java.lang.ArrayIndexOutOfBoundsException

     

    I also notice that I have an error on WeightByCorrelation within LoopValues. It says "metadata.error.missing_role". However, the input data does have an attribute (cluster) whose role is "label" (applied using the SetRole operator in the parent process). I can verify this on the input connector.

     

    I'm attaching the current process XML. I need to see if I can also post sample data.

  • dsackindsackin Member Posts: 5 Contributor I

    Also, the same (I think) missing role error is propagated into the parent process on the LoopValues operator input where it says "The attribute 'cluster' is missing in the input data set", but you can see in the screenshot that it is present.

Sign In or Register to comment.