I am also interested in how to create the "elbow criteria" against a set of clusters (text clusters in this case).

Wikipedia describes "elbow criteria" as:

..a method that looks at the percentage of variance explained as a function of the number of clusters: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the "elbow criterion".

It uses a parameter iteration for the number of clusters (k) and a Log operator for collecting the values for DB-Index (DB) and the average within cluster distance (W). The process log can then be inspected as a table or immediately plotted. I recommend the plot type "Scatter Multiple" with "k" on the x-axis and both "DB" and "W" on the y-axis. In the settings at the bottom you could even activate lines between the points simplifying the detection of the elbow.

The result looks like this:

I leave it to you to determine if 3, 4, or 5 clusters should be used in this case ;-)

More advanced users could even transform the log table into an example and try to automatically extract the desired number of clusters based on the change of angles between the segments.

## Answers

14Contributor IIWikipedia describes "elbow criteria" as:

..a method that looks at the percentage of variance explained as a function of the number of clusters: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the "elbow criterion".

This link show a visual example: http://upload.wikimedia.org/wikipedia/commons/c/cd/DataClustering_ElbowCriterion.JPG

Any suggestions are appreciated!

Paul

1,751RM Founderin the "Samples" repository delivered together with RapidMiner you can find an example for creating the desired plot:

//Samples/processes/07_Clustering/09_KMeansWithPlot

It uses a parameter iteration for the number of clusters (k) and a Log operator for collecting the values for DB-Index (DB) and the average within cluster distance (W). The process log can then be inspected as a table or immediately plotted. I recommend the plot type "Scatter Multiple" with "k" on the x-axis and both "DB" and "W" on the y-axis. In the settings at the bottom you could even activate lines between the points simplifying the detection of the elbow.

The result looks like this:

I leave it to you to determine if 3, 4, or 5 clusters should be used in this case ;-)

More advanced users could even transform the log table into an example and try to automatically extract the desired number of clusters based on the change of angles between the segments.

Cheers,

Ingo

14Contributor IIMy sincere thanks for taking the time to answer this question! ;D

Paul