"Support Vector Clustering"

vijaypshahvijaypshah Member Posts: 30 Maven
edited May 2019 in Help
Hi,
I am trying to cluster the 20,000 sample using support vector machines. It takes around 48-hr to get the clustering result. How can I optimize this process to get some good results with acceptable time limit (say 15-20 minutes).

Regards,
Vijay
Tagged:

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    you have at least two options:
    • reduce the maximum number of iterations from 100000 to a smaller value, let's say 1000. This of course might affect the quality of the output.
    • increase the size for the kernel_cache (for 20000 examples you would need about 3Gb memory for a full kernel matrix caching). Try larger values and increase the amount of memory which can be used by RapidMiner if necessary / possible. This should lead to a great speed up without loosing quality.
    Cheers,
    Ingo
  • vijaypshahvijaypshah Member Posts: 30 Maven
    I am running one process for 24 hr now. I had reduced max iteration to 1000 and increased memory to 2 GB. But sill haven't got any results. I  should be able to go for 3 GB since I have 64 bit machine and 4 GB RAM. Probably time to get more memory!

    Is it possible to subset the data in smaller chunck and than do clustering, and combine the final clustering result?

    Regards,
    Vijay
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    beside increasing the memory available in total for RM you probably also have to increase the memory defined by the kernel_cache parameter. However, since prices for memory are rather low at the moment, increasing the total amount of memory is probably the most simple idea if you have a 64 bit system anyway.

    Is it possible to subset the data in smaller chunck and than do clustering, and combine the final clustering result?
    In principle yes. You could for example use the cross validation operator (with a dummy learner) for sampling by placing an ExampleSetWriter with macro option %{a} in the filename to build k disjunct parts of your data. Then apply the clustering individually and merge the results with the operator ExampleSetMerge. It might however be necessary to remap the cluster labels appropriately before.

    Cheers,
    Ingo
  • vijaypshahvijaypshah Member Posts: 30 Maven
    Yes, I use 2GB as kernel_cache parameter, but will try 3GB once the current process is over.
    mierswa wrote:


    It might however be necessary to remap the cluster labels appropriately before.

    Is there any operator that helps to remap the cluster label?  Example XmL might be useful...

    Regards,
    Vijay
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    Is there any operator that helps to remap the cluster label?  Example XmL might be useful...
    this can currently not be done automatical but you can use the operator "AttributeValueMapper" for this purpose.

    Cheers,
    Ingo
  • vijaypshahvijaypshah Member Posts: 30 Maven
    Before using attributevaluemapper it might be necessary to decide  on cluster number ie. how does one know that cluster no 1 in example set 1 is related to say cluster no 4 in example set 2. I will google to see if I can find more information.
    Thanks for valuable input.

    Regards,
    Vijay
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Yes, that's the problem. Right now we don't have an operator for that but it would probably be a good idea to write a general operator which maps different groups on the best matching group of another attribute based on the data points in those groups. This could also be useful for cluster evaluations by comparing found clusters to predefined groups.

    Cheers,
    Ingo
  • vijaypshahvijaypshah Member Posts: 30 Maven
    Before few months I had read few papers on different techniques use for combining cluster based on similarity metric...but I had my concern with those approaches too.
    I will post the link to those paper if I come across those again.
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi Vijay,

    yes, we would appreciate that. If our schedule allows, we will certainly have a look at these approaches.

    Regards,
    Tobias
  • jdouetjdouet Member Posts: 19 Maven
    Hello Vijay, Hello Tobias, Hello Ingo,

    Working with RM 4.2...

    I have a problem with support vector clustering, actually with these three operators :
    • Support Vector Clustering (here it is !) : the ouput is not recognised as either a flat cluster or a hierarchical one
    • KernelKmeans : Same problem, moreover the "neural choice" is not there in "choose the kernel type"
    • FlattenClusterModel : when "performance?" is true, checking the experiment's syntax does not recognize "performance vector" produced
    I wanted to use one of these operators in a experiment containing the following code :
    <operator name="analyse" class="OperatorChain" expanded="yes">
            <operator name="EvolutionaryParameterOptimization" class="EvolutionaryParameterOptimization" expanded="yes">
                <list key="parameters">
                  <parameter key="KernelKMeans.kernel_degree" value="[0.0;2.147483647E9]"/>
                  <parameter key="KernelKMeans.k" value="[2.0;2.147483647E9]"/>
                </list>
                <operator name="KernelKMeans" class="KernelKMeans">
                    <parameter key="add_cluster_attribute" value="false"/>
                    <parameter key="kernel_type" value="KernelPolynomial"/>
                </operator>
                <operator name="ItemDistributionEvaluator" class="ItemDistributionEvaluator">
                    <parameter key="keep_flat_cluster_model" value="false"/>
                    <parameter key="measure" value="SumOfSquares"/>
                </operator>
            </operator>
        </operator>

    Do you reproduce these behaviours ?
    Cheers,
      Jean-Charles.
Sign In or Register to comment.