Time Optimization

ratheesanratheesan Member Posts: 68 Maven
Hi,
I am working with KMedoids clustering with 1.7MB text data.But it has been running for the last 3 and half days.The other operators took only 10 minutes .The KMedoids only taking the remaining  time.Is there any way to optimize the process.The process is mentioned below.

<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#Optimizing vector creation for text classification#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to apply a cross validation to a classifier that learns to separate two sets of texts.#ylt#/p#ygt#"/>
    <operator name="ExcelExampleSource" class="ExcelExampleSource">
        <parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
        <parameter key="first_row_as_names" value="true"/>
    </operator>
    <operator name="Nominal2String" class="Nominal2String">
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <list key="namespaces">
        </list>
        <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
        </operator>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
        </operator>
    </operator>
    <operator name="KMedoids" class="KMedoids">
        <parameter key="k" value="25"/>
    </operator>
    <operator name="AttributeFilter" class="AttributeFilter">
        <parameter key="condition_class" value="is_nominal"/>
    </operator>
    <operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
        <parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\cluster1.xls"/>
    </operator>
</operator>

Thanks
Ratheesan

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    unfortunately it takes time to calculate all the distances needed. One hint: It might be useful to switch to CosineSimilarity. That's more suitable for text mining than euclidean distance.

    Greetings,
      Sebastian
  • ratheesanratheesan Member Posts: 68 Maven
    Thanks Sebastian,
    Suppose I am using RM Enterprise edition,will it take the same amount of time when we are using RM Community version.

    Thanks
    Ratheesan
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    we have parallelized many important operators for the Enterprise Edition, but KMedoids is not part of it. But for the money of an Enterprise Edition, we could write you a parallelized KMedoids. One could even think about optimizing the operator for small example sets with many attributes like it is frequent in text mining tasks.

    Greetings,
      Sebastian
  • ratheesanratheesan Member Posts: 68 Maven
    Hi Sebastian,

    I have tried the above process with Cosine similarity.But always getting the  message " There is no obvious error,check the log file".Before applying KMedoids I used Attribute filter operator and selected numeric attributes because in KMedoids  Numerical measures only provides Cosine similarity.

    Thanks
    Ratheesan
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    please send me your process. I will check if there's a bug.

    Greetings,
      Sebastian
  • ratheesanratheesan Member Posts: 68 Maven
    Hi Sebastian,

    Thanks for your valuable help. This is my process

    <operator name="Root" class="Process" expanded="yes">
    <operator name="ExcelExampleSource" class="ExcelExampleSource">
    <parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\data1.xls"/>
    <parameter key="first_row_as_names" value="true"/>
    </operator>
    <operator name="Nominal2String" class="Nominal2String">
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
    <list key="namespaces">
    </list>
    <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
    </operator>
    <operator name="StringTokenizer" class="StringTokenizer">
    </operator>
    <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
    </operator>
    <operator name="TokenLengthFilter" class="TokenLengthFilter">
    </operator>
    <operator name="PorterStemmer" class="PorterStemmer">
    </operator>
    </operator>
    <operator name="AttributeFilter" class="AttributeFilter">
    <parameter key="condition_class" value="is_numerical"/>
    <parameter key="parameter_string" value="sample"/>
    <parameter key="apply_on_special" value="true"/>
    </operator>
    <operator name="KMedoids" class="KMedoids">
    <parameter key="k" value="3"/>
    <parameter key="max_runs" value="5"/>
    <parameter key="max_optimization_steps" value="10"/>
    <parameter key="measure_types" value="NumericalMeasures"/>
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
    <parameter key="excel_file" value="C:\Documents and Settings\ADMIN\Desktop\modelcluster.xls"/>
    </operator>
    </operator>


    If am using up to 250 records,its working properly but if going for more than 250 records I am getting the above message.

    Thanks
    Ratheesan.

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    the process just runs fine on here. I used 722 texts, but there was no error, at least not at the first few minutes of the KMedoids run.

    Of course I don't have exactly the same setup, because I'm using different texts. Uhm. I suggest, you should switch your RapidMiner to debug mode, so that you could post me the detailed error message. Go to the Tools menu and select Preferences. Enable the rapidminer.general.debugmode checkbox in the tab General.
    Then please reexecute the process and send me the error message.

    Greetings,
      Sebastian
  • ratheesanratheesan Member Posts: 68 Maven
    Hi Sebastian,

    I reexecuted the process after changing to the debug mode.Here I am attaching the error message.

      Root[1] (Process)
              +- ExcelExampleSource[1] (ExcelExampleSource)
              +- Nominal2String[1] (Nominal2String)
              +- StringTextInput[1] (StringTextInput)
              |  +- ToLowerCaseConverter[600] (ToLowerCaseConverter)
              |  +- StringTokenizer[600] (StringTokenizer)
              |  +- EnglishStopwordFilter[600] (EnglishStopwordFilter)
              |  +- TokenLengthFilter[600] (TokenLengthFilter)
              +- AttributeFilter (2)[1] (AttributeFilter)
    here ==> +- KMedoids[1] (KMedoids)
    java.lang.NullPointerException
        at com.rapidminer.operator.clustering.clusterer.KMedoids.generateClusterModel(KMedoids.java:176)
        at com.rapidminer.operator.clustering.clusterer.AbstractClusterer.apply(AbstractClusterer.java:60)
        at com.rapidminer.operator.Operator.apply(Operator.java:671)
        at com.rapidminer.operator.OperatorChain.apply(OperatorChain.java:424)
        at com.rapidminer.operator.Operator.apply(Operator.java:671)
        at com.rapidminer.Process.run(Process.java:735)
        at com.rapidminer.Process.run(Process.java:704)
        at com.rapidminer.Process.run(Process.java:694)
        at com.rapidminer.gui.ProcessThread.run(ProcessThread.java:59)

    Thanks
    Ratheesan.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    that's quite strange. The distance measure seems to return NaN, that's the only way, why this could happen.
    Unfortunately I cannot debug anything more detailed, because I can't reproduce this error. Do you have any missing values in your data?

    Greetings,
      Sebastian
  • ratheesanratheesan Member Posts: 68 Maven
    Hi Sebastian,

    Here I have no missing value.But I am getting the output using Dice similarity.Is it meaningful for using Dice similarity in text mining.

    Thanks
    Ratheesan.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    this is a forum, neither this is consulting nor is it a course. I cannot answer EACH question regarding this or that algorithm or measure. Just try it out yourself. In fact, you cannot even say what is a good measure or algorithm, because this always depends on the data, on your data, I don't have.

    Greetings,
      Sebastian
Sign In or Register to comment.