How to reuse preprocessing results in a range of k-means clustering

albertoarenalalbertoarenal Member Posts: 10 Contributor II
edited December 2018 in Help

Hi all,

 

I am conducting a K-Means clustering analysis to several groups of documents and I would like to evaluate the clustering performance of different K ( K=4 to 20) by comparing their respective Davies-Bouldin indexes.

 

Previously to the clustering algorithm, I apply a preprocessing tasks (to transform cases, tokenize, filter stopwords, steeminng...creating a tf-if vector). The output of this preprocessing tasks is always the same for each group of texts (attached the general view of the process)

 

Now I am playing the process for each value of K, but I would like not to repeat this preprocessing tasks, which is the same for each group of text, every time I do the K clustering clustering and calculating davies-bouldin indexes, basically to save a lot of time 

 

Thank you very much in advance

Alberto

Best Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    Just add a loop after the preprocessing steps to run k-means and save the output you want and then cycle through the different k-values you would like using a loop macro.

     

    An alternative would be to Store the results after pre-processing them and then create a separate process that starts by Retrieving that dataset before each run of the clustering (also within a loop).  Either approach should work.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • nmaheshnmahesh Member Posts: 3 Contributor I
    Solution Accepted

    Hi Alberto,

     

    Have you tried using the store operator for the pre-processing? I would then create different processes to try out different parameter changes to your clustering and performance.

     

    Best,

    Nithin Mahesh

Answers

  • albertoarenalalbertoarenal Member Posts: 10 Contributor II

    Thank you Brian,

    I´m a beginner using Rapidminer and I´ve not considered the option of storing/retrieving the output of the preprocessing tasks. It is a very good option and I´m sure it save me a lot of time.

     

    I wouldn´t like to take up much of your time, but I have already considered the use of a loop for proving diferent K, but I have not found the right way  to implement it. Could you provide an example? I tried with the cluster loop operator just between the retrieve operator and the clustering operator, but I don´t know how to change the k

     

    Thanks again
    alberto

     

  • albertoarenalalbertoarenal Member Posts: 10 Contributor II

    Thank you Nithin, both Brian´s and your proposal about storing/retrieving the output of the preprocessing tasks have been very useful

    Alberto

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Sure, here's a sample process with k-means clustering and the Loop Parameters operator.

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Sonar"/>
    </operator>
    <operator activated="true" class="loop_parameters" compatibility="7.5.003" expanded="true" height="103" name="Loop Parameters" width="90" x="246" y="85">
    <list key="parameters">
    <parameter key="Clustering.k" value="[2.0;10;8;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="k_means" compatibility="7.5.003" expanded="true" height="82" name="Clustering" width="90" x="313" y="85">
    <parameter key="k" value="10"/>
    </operator>
    <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
    <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve Sonar" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
    <connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
    <connect from_op="Loop Parameters" from_port="result 2" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • albertoarenalalbertoarenal Member Posts: 10 Contributor II

    Thank you  Telcontar120, I will prove this, it is vert useful, I really appreaciate your help!

Sign In or Register to comment.