[SOLVED] Assigning "Topics" to Text Clusters

dyneradynera Member Posts: 14 Contributor II
Hi All,

I recently used the k-Means operator in a process to cluster several thousand message board posts.  Now that I have the clusters I'd like somehow "classify" them based upon type of content/key words they contain.

Any recommendations on how best to do this?  I read a paper on the ROCK algorithm that somehow assigns topics to documents based on key word frequency, but it doesn’t appear we have this algorithm in Rapidminer.

Also, how do I know what if I am producing a reasonable number of clusters with k-Means based upon my content?



  • Options
    SkirzynskiSkirzynski Member Posts: 164 Maven

    There is no direct way to derive a topic from the documents. But there are several ways to determine characteristic words for a cluster, which can be seen as an description or topic for this cluster.

    For instance, if you have used TF-IDF to create the word vectors, the value for a particular attribute represents the relevance of a term in this document. The "Extract Cluster Prototypes" operator will create you one representative for every cluster. This exampe set can be transposed, so every cluster column can be sorted to get the top 5 relevant words, which somehow describe the cluster.

    Another approach would be to use the cluster as a label and apply a classification learner which returns a weight vector. This weight vector can also be used to determine the relevance of an attribute (i.e. term).

    Regarding your second question: Choosing the best number of clusters means to decide if a given clustering is better than another. In contrast to classification we do not know the truth, thus, this decision is not easy at all. Since a reasonable definition of a good clustering  is to group objects which are similar and separate objects which are dissimilar, RapidMiner offers a operator called "Cluster Density Performance" to reflect this definition. So, use the "Optimize Parameter" operator and add this performance operator inside (to get the similarity IOObject try the "Data to Similarity" operator).

  • Options
    dyneradynera Member Posts: 14 Contributor II
    Hi Marcin,

    Thanks for replying to my post!

    I used your suggestion and added the "Extract Cluster Prototypes" to my process.  I simply joined the opertator to the end of my k-Means operator, but unfortunatley I got an error message: "Process failed. Duplicate attribute name: cluster."

    It appears that the k-Means operator generates an attribute called "cluster" which the Extract Cluster Prototypes doesn't like.  Do you know a way around this?

    Thanks again!  ;D

  • Options
    SkirzynskiSkirzynski Member Posts: 164 Maven
    Works for me, even with a cluster attribute. If you post a minimal example of your failing process (please use the code-tags), I can take a look.
  • Options
    dyneradynera Member Posts: 14 Contributor II
    Here's my process.  Thanks again for taking a look at it.  Let me know if you need additional information, Marcin.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
     <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
       <process expanded="true" height="685" width="1016">
         <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Clarity_XOG_GEL_WSDL" width="90" x="45" y="300">
           <parameter key="repository_entry" value="../Data/Clarity_XOG_GEL_WSDL"/>
         <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="210">
           <parameter key="add_meta_information" value="false"/>
           <parameter key="keep_text" value="true"/>
           <parameter key="prune_method" value="absolute"/>
           <parameter key="prune_below_absolute" value="2"/>
           <parameter key="prune_above_absolute" value="9999"/>
           <list key="specify_weights"/>
           <process expanded="true" height="550" width="728">
             <operator activated="true" class="web:extract_html_text_content" compatibility="5.2.003" expanded="true" height="60" name="Extract Content" width="90" x="112" y="30">
               <parameter key="minimum_text_block_length" value="3"/>
             <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="120"/>
             <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="210"/>
             <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="300"/>
             <operator activated="true" class="text:replace_tokens" compatibility="5.2.004" expanded="true" height="60" name="Replace Tokens" width="90" x="246" y="300">
               <list key="replace_dictionary">
                 <parameter key="chris" value=" "/>
                 <parameter key="clarity" value=" "/>
                 <parameter key="ca" value=" "/>
                 <parameter key="com" value=" "/>
                 <parameter key="hi" value=" "/>
                 <parameter key=" munity" value=" "/>
                 <parameter key=" munities" value=" "/>
                 <parameter key="use" value=" "/>
                 <parameter key="user" value=" "/>
             <operator activated="false" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="581" y="480"/>
             <operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="380" y="300">
               <parameter key="min_chars" value="2"/>
             <operator activated="false" class="text:generate_n_grams_terms" compatibility="5.2.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="581" y="390">
               <parameter key="max_length" value="3"/>
             <connect from_port="document" to_op="Extract Content" to_port="document"/>
             <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
             <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
             <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
             <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Replace Tokens" to_port="document"/>
             <connect from_op="Replace Tokens" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
             <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
         <operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="313" y="165">
           <parameter key="k" value="5"/>
         <operator activated="true" class="extract_prototypes" compatibility="5.2.008" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="581" y="345"/>
         <connect from_op="Clarity_XOG_GEL_WSDL" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
         <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
         <connect from_op="Process Documents from Data" from_port="word list" to_port="result 4"/>
         <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
         <connect from_op="Clustering" from_port="clustered set" to_port="result 3"/>
         <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
         <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
         <portSpacing port="sink_result 4" spacing="0"/>
         <portSpacing port="sink_result 5" spacing="0"/>

  • Options
    SkirzynskiSkirzynski Member Posts: 164 Maven

    Since I do not have access to your data I cannot reproduce your error, but the parameters look correct and if I use my own input it works. So, you have to post a self-running process with this error, or just take a look on the data and especially attributes with the name "cluster". ;)

    Best regards
  • Options
    dyneradynera Member Posts: 14 Contributor II
    Hi Marcin,

    I found the culprit.  I had such a large text set that it just so happened that I had a token called "cluster."  Thanks for the advice!

    I successfuly ran the process and generated the example set from the cluster prototype operator.

    Do you know a way to filter or select the top words in a cluster the example set?  I tried writing the output to excel and then transforming rows to columns so I could filter the columns, but there were too many columns for excel to handle.


  • Options
    dyneradynera Member Posts: 14 Contributor II
    Nevermind Marcin.  I found the "Traspose" operator  :)

  • Options
    MarcosRLMarcosRL Member Posts: 53 Contributor II
    Hello, I am trying to extract the words obtained characteristics of groups, with small data sets I work. With a large data set I get the following error "Duplicate atributte name: cluster"
    Is the problem in the data set? how could I fix it?
  • Options
    MarcosRLMarcosRL Member Posts: 53 Contributor II
    I solved it with the operator "replace token" in the dataset. Replace the attribute "cluster" equivalent by another name to avoid conflict of duplicate  8)
    regards  :D
Sign In or Register to comment.