Options

"Dynamically determine number of clusters k-means"

Farnoush_rFarnoush_r Member Posts: 5 Contributor II
edited June 2019 in Help
Hi
I want to build a model in rapid miner that can predict the number of clusters automatically and then continue to the k-means algorithm. The below post has some great ideas but it is connected to a log table. Is there any way to do this dynamically and create a macro to calculate the number of clusters and give it to k-means?
http://rapid-i.com/rapidforum/index.php?topic=3447.0
Tagged:

Answers

  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    It is possible to convert a log to an example set; use the Log to Data operator. For Davies-Bouldin, you could look for a minimum by sorting this example set by the validity measure and then simply using the value of k that is associated with it.

    If you are confident that the data is well behaved in all cases then you could try that.

    Regards

    Andrew
  • Options
    Farnoush_rFarnoush_r Member Posts: 5 Contributor II
    Thank you for your helpful response but i have two more questions. First, I followed your proposition I have an example set which determines the best number of clusters, but is it possible to enter this to a clustering node and the clustering node read the number of clusters from the data? I thin the k should be set in the clustering node and it does not read it from an outer source

    Second, I did not understand your worry about my data, cause apparently I am determining k each time based on the imported data and with any data the process determines the best k. so what is the problem?
  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    For the first question, use the Extract Macro operator to get the data value of a particular attribute and example within an example set. Use that macro later however you want.

    For the second question, the Davies Bouldin validity measure uses mathematics to create a measure to identify clusters that are  relatively less scattered individually and are maximally separated from one another. Who is to say whether this mathematical algorithm matches what truly is the best clustering?
Sign In or Register to comment.