"unsupervised cluster evaluation"

nguyenxuanhaunguyenxuanhau Member Posts: 22 Contributor II
edited June 2019 in Help
Hi!
can I compare unsupervised cluster evaluations(clustermodel evaluation) each other on unlabeled data on RM?
what must I do to compare unsupervised cluster evaluations (clustermodel evaluation)each other on unlabeled data in RM?
Best regard
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    if you want to compare how much two cluster outcomes match each other, you can simply rename the first one and assign it the role "label" before actually performing the second evaluation. If you then would set the role of the second cluster attribute to prediction, you can use standard accuracy measure to measure the equality.

    Greetings,
    Sebastian
  • nguyenxuanhaunguyenxuanhau Member Posts: 22 Contributor II
    Hi
    Please detail do that, how do the method chose the best cluster  on my data ( my data is large but unlabled)?
    Bestregard  
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    it doesn't. How could you know what is the best cluster? Guessing?
    There are some cluster evaluation heuristics available, but as their name says: They are just heuristics.

    Greetings,
      Sebastian
  • nguyenxuanhaunguyenxuanhau Member Posts: 22 Contributor II
    so, to compare two clustering algorithms how do i do? to know a algorithm is better than one (on unlabeled data)
    Best regard
  • dragoljubdragoljub Member Posts: 241 Contributor II
    Clustering depends on the similarity measure you choose. If you know what it means for two samples to be similar almost any clustering method would give you good results. Its more about the similarity metric than the algorithm. Algorithms just have different methods overcome special cases where you may not know the number of clusters or size. Either way there is no easy answer to comparing two clustering methods.  ;D
  • dan_agapedan_agape Member Posts: 106 Maven
    Particular solutions may be found from case to case.

    For instance if the data is numeric and tends to form centre based clusters (data visualisation may give you an indication), then the solutions based on the same number of clusters can obviously be compared using the so called squared error (i.e. the sum of squared distances from the data instances to the corresponding cluster centre - which is computed by averaging the column values in each cluster). Smaller squared error means better clustering. This method is used even for the application of the same algorithm that may lead to more than one solution (as the K-Means algorithm). The method may be partly extended for mixed (numeric and non numeric) data (in which case specific metrics replace the Euclidian distance, as for instance in the K-Medoids algorithm, that extends K-Means).

    Another solution may be based on the idea of evaluating the result of an unsupervised clustering via supervised learning evaluation. You can cluster your data obtaining a new column - let us call it clusterNo. Then you learn a decision tree (or another model issued from a supervised learning) using clusterNo as your label/output attribute, and then you evaluate this model/tree. The accuracy of the model may give an indication of the quality of the clustering. Obviously, no method based on heuristics is perfect, but may be quite useful in practice.

    Dan
         
  • dan_agapedan_agape Member Posts: 106 Maven
    Just to add that k Nearest Neighbours may often be a good choice here as a supervised learning technique. The accuracy of its model would intuitively indicate here the likelihood that close instances were placed in the same cluster (which is expected in a good clustering).

    Dan
Sign In or Register to comment.