X means always uses minimum cluster amount

lizzie_a_martinlizzie_a_martin Member Posts: 2 Contributor I
edited December 2018 in Help

Hi, I am pretty new to rapidminer, so I apologize if this question is trivial.  I am trying to use X means to cluster some text files.  At first I was using K means, but I didn't know how many clusters to use, so I decided to try X means instead.  However, the X means operator always uses the minimum number of clusters in the given range.  This doesn't seem correct to me, so I'm wondering if I have some settings incorrect or something. Here are the settings I am using:

 

add cluster attribute is checked

k min: 2

k max: 60

measure types: NumericalMeasures

numerical measure: CosineSimilarity

clustering algorithm: KMeans

max runs: 100

max optimization steps: 100

 

I have 150 text files that I am trying to cluster, maybe I am not using enough? Any thoughts and tips would be greatly appreciated!

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @lizzie_a_martin - welcome to the community.  Can you please post your XML in this thread so we can see what you are doing?  Instructions are on the right (see "Read Before Posting #2).

     

    Scott

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Assuming there isn't a problem with your process, it's probably because you don't have too many examples for clustering, or they are simply too similar to one another so the X-means always resorts to the simplest clustering scheme. But you should also make sure that you've normalized the data beforehand, because clustering is sensitive to absolute ranges of distances, and if you have any other attributes (other than the word vector created by TF-IDF) then differences in scale could be distorting the algorithm as well.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • lizzie_a_martinlizzie_a_martin Member Posts: 2 Contributor I

    Thank you for the response! I'll try with more text files now.  This is my xml code as asked, I'm not sure I'm doing what you're saying about normalizing.  I assume that's another operator that I need?

     

    [/code]

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files (2)" width="90" x="112" y="34">
    <list key="text_directories">
    <parameter key="SampleSet" value="C:\Users\Lizzi\Desktop\Sample Data"/>
    </list>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="absolute"/>
    <parameter key="prune_below_absolute" value="2"/>
    <parameter key="prune_above_absolute" value="999"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (3)" width="90" x="179" y="85"/>
    <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (3)" width="90" x="179" y="187"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="179" y="289"/>
    <operator activated="true" class="text:stem_porter" compatibility="7.5.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="179" y="391"/>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="85">
    <parameter key="min_chars" value="2"/>
    <parameter key="max_chars" value="999"/>
    </operator>
    <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
    <connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
    <connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
    <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
    <connect from_op="Stem (Porter)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
    <connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply" width="90" x="112" y="136"/>
    <operator activated="true" class="x_means" compatibility="7.6.001" expanded="true" height="82" name="X-Means" width="90" x="313" y="187">
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    <parameter key="max_runs" value="100"/>
    </operator>
    <operator activated="true" class="data_to_similarity" compatibility="7.6.001" expanded="true" height="82" name="Data to Similarity" width="90" x="313" y="85">
    <parameter key="measure_types" value="NumericalMeasures"/>
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Multiply" to_port="input"/>
    <connect from_op="Process Documents from Files (2)" from_port="word list" to_port="result 2"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Data to Similarity" to_port="example set"/>
    <connect from_op="Multiply" from_port="output 2" to_op="X-Means" to_port="example set"/>
    <connect from_op="X-Means" from_port="cluster model" to_port="result 3"/>
    <connect from_op="X-Means" from_port="clustered set" to_port="result 4"/>
    <connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    </process>
    </operator>
    </process>

    [/code]

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @lizzie_a_martin - thanks for posting your XML.  It's hard to see exactly what you have without your test data file ("Sample Data") but I get the general idea.

     

    What @Telcontar120 is saying is that, in order to look at all your attributes equally, you need to ensure that each attribute has the same "scale".  If you had ages (say a range from 10-99) and then word vectors (range 0-1), then the ages are far more weighted than the words.  But if you convert the ages to a normalized scale (usually z-scores), then you have converted to a 0-1 scale like the others.  The operator in RapidMiner is called "Normalize".

     

    As far as your question about concern about k=2 being optimal, it does not shock me at all.

     

    Scott

     

     

Sign In or Register to comment.