Number of Clusters for Support Vector Clustering (SVC)

Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
edited June 2020 in Help
Dear community, 

I applied the SVC approach based on high dimensional data with the default setting (kernel type: radial) and got only one sole cluster as result. This suprised me a lot.

How to set the number of clusters for SVC? In this connection, is there a possibility to evaluate and validate the number of clusters of SVC by a performance operator within RapidMiner? 

Thanks in advance for your answers! 

Best regards!

Answers

  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Is anybody here who can help in the described issue? 

    Best regards! 
  • sara20sara20 Member Posts: 110 Unicorn
    edited July 2020
    @Muhammed_Fatih_,

    Hello

    Please take a screen from the cluster. Did you try Auto Model for that?

    Thank you
    Sara
  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Hello @sara20

    Auto model does not provide SVC as I know. 

    I applied the SVC operator on my high-dimensional database by considering the following parameter setting: minpts=10, gamma=0,005 and p=0,01. And I got the attached cluster: 


    So which parameter constellation is needed or rather would you propose for high dimensional data? I think this is the elemental question here. Or what do you think? 

    I thank you in advance for your feedback! 

    Best regards!
  • sara20sara20 Member Posts: 110 Unicorn
    edited July 2020
    @Muhammed_Fatih_

    Hello

    From my understanding you have 2 clusters, It shows that your data have very similar parts. So from your first text if you have 1 cluster they are very similar with each other but if you have 2 clusters like your screen, RM can divided you data in 2 parts. I think 2 cluster is better than 1. Also if you need to compare  clusters with 2 cluster that is possible. 
    Finally it depends on your work and your data.

    I hope this helps
    Sara
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    There is no way to explicitly set the number of clusters in advance with SVC.  The point is to allow the algorithm to detect the correct number of clusters based on the underlying data.  You can play with the other ML parameters to see whether that changes the number of clusters found (it usually does).  As Sara noted, your results show two clusters now (java counting starts at zero so you have cluster 0 and cluster 1).
    If you need to specify the number of clusters in advance, you should try k-means.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Hello @sara20
    hello @Telcontar120

    thank you for your interesting feedback! Yes, it is correct that the SVC Clustering detects two clustering groups based on the default operator parameters.  

    The statement of @Telcontar120 is especially the one I am interested in: 

    "You can play with the other ML parameters to see whether that changes the number of clusters found (it usually does)."

    According to which criteria should these parameter settings be changed? Is it the number of input data which is considered for the clustering process? Which parameters should be changed and in which extent should they be modified? So the question targets more what the parameters do in detail. 

    I hope this clarification helped to underline the focus of my question. I thank you in advance for your answers! 

    Best regards & Stay healthy! 
  • sara20sara20 Member Posts: 110 Unicorn
    edited July 2020
    @Muhammed_Fatih_,

    Hello

    It depends on your data. If they are very similar with each other , it is very difficult to separate them in different clusters but I think totally you should find a central point for each clusters in your data, in this situation you will understand more about your data and can understand better about your clusters. Now try to visualize your data then you will see every thing or you can make a curve line with your data then according to the points that show the carve change you can have number of clusters. I recommend you first cluster your data with Auto Model with K means or C means then choose best number of clusters. ( I want you see first your data very clear then decide for that so the first  step is visualization. :):):) )

    For more information:

    This operator is an implementation of Support Vector Clustering based on Ben-Hur et al (2001). In this Support Vector Clustering (SVC) algorithm data points are mapped from data space to a high dimensional feature space using a Gaussian kernel. In feature space the smallest sphere that encloses the image of the data is searched. This sphere is mapped back to data space, where it forms a set of contours which enclose the data points. These contours are interpreted as cluster boundaries. Points enclosed by each separate contour are associated with the same cluster. As the width parameter of the Gaussian kernel is decreased, the number of disconnected contours in data space increases, leading to an increasing number of clusters. Since the contours can be interpreted as delineating the support of the underlying probability distribution, this algorithm can be viewed as one identifying valleys in this probability distribution.

    https://docs.rapidminer.com/latest/studio/operators/modeling/segmentation/support_vector_clustering.html
    http://www.scholarpedia.org/article/Support_vector_clustering
    Kind regards
    Sara
  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Hello @sara20

    thank you for your feedback!

    I've already evaluated the number of clusters by considering the Kmeans clustering approach. I agree that this should be the first step before investigating other clustering techniques.

    In this sense I wanted to subsequently apply SVC to be able analyze how many numbers of clusters will SVC detect. As I have mentioned above, the SVC detected two clusters (one of them very small) with the default parameter setting, whereas kMeans detected 7 cluster groups. This anomaly confused me a bit.

    Therefore the question if this could be an issue of parameter optimization due to the reason I am considering a high dimensional database. In this connection, the paper of Ben-Hur et al. (2001) unfortunately does not evaluate varying paremeter settings. It is therefore not clear which parameter setting would be the appropriate one for my data.

    Which setting would you choose for a database with: 70.000 objects/lines and 8.000 attributes/columns?  

    Best regards!
  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    @Muhammed_Fatih_ what type of pre processing is done on the high dimensional database are all of those attributed adding value to the clustering? is my understanding that you could and should reduce the amount of attributes used before applying any clustering techinques.
    Remove correlated attributes, use PCA to understand which attributes explain the variance of your data. Maybe you'll end up workin with less than 30% of the initial attributes.
    If you want to "play" with the parameters and understand if any change on them affects the number of clusters returned then use the Optimization parameter and define some ranges for the parameters this way you can test a wide range of configurations and see if they have any impact on your data. 
    If you had any label (not used for clustering) on your data you could then use the Weight of Evidence operator to transform the values of some of your Numerical attributed so that the separation increases.
    Don´t forget to apply Normalization on your Numerical Data since outliers affect clusters due to their nature of finding the centroids for the clusters.
    Hope this information is useful.
     
  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Hello @MarcoBarradas

    very important and useful insights, which I have already partly implemented. I applied PCA based on the rough data set and derived the described data.

    Your recommendation with regard to the Optimization parameter is a very good one. Here is again the question of which parameters should be optimized if we exclude the challenge with the running time. As @sara20 mentioned: 

    As the width parameter of the Gaussian kernel is decreased, the number of disconnected contours in data space increases, leading to an increasing number of clusters.
    Hence, this could be an option but how many iteration steps would be appropriate for SVC if the default value is set to 1.0 for gamma. On the other hand, the tutorial process given by RapidMiner sets the gamma to 0.005. According to which criteria? 

    Which SVC paramters would you optimize and in this connection in which iteration steps? 

    Best regards!
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    This is hard to say in the abstract because the clustering is very dependent on your data.
    But if you read through the help text of the SVC operator, you will find that two parameters that are highly significant in determining the number of clusters are p, the proportion of outliers allowed, and r, the target radius of the clusters.  Their default settings may not be giving you the optimal number of clusters.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sara20sara20 Member Posts: 110 Unicorn
    edited July 2020
    Hi all

    @Muhammed_Fatih_,

    I agree with all people and number of clusters depend on your data.

    I hope this helps
    Sara
  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Dear all and @sara20, @Telcontar120 and @MarcoBarradas

    thank you for your feedback. Optimizing the mentioned parameters seems to be an appropriate way of determining the paramters. In this connection, is there a evaluation measure which fits to SVC? As I know, there is no one implemented in RapidMiner. Can you confirm this information? 

    Best regards!  
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Well, there actually are several performance operator for clusters, such as cluster distance performance and cluster density performance.  You might want to check those out.  But the problem with unsupervised ML in general is that there is no clear "correct" answer so the "best" cluster performance is somewhat in the eye of the beholder.   
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Hi @Telcontar120  

    are the operators "Cluster distance performance" and "Cluster density performance" applicable for SVC?

    E.g. the documentation states the following: "This operator is used for performance evaluation of centroid based clustering methods.". Hence, SVC does not belong to the centroid based clustering approache as well as the second operator for densitiy based clusters. 

    Do the both performance operators anyway fit with SVC? 

    Best regards
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Yes, you are correct.  Sorry, I thought you were asking about clustering performance operators in RapidMiner in general.  I am not aware of a performance operator for SVC other than the generic Cluster Count operator, which is not really all that useful.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Muhammed_Fatih_Muhammed_Fatih_ Member Posts: 93 Maven
    Is there anybody else who can reccomend performance evaluation for SVC? 
Sign In or Register to comment.