"text mining and EMCluster"

rdmckinneyrdmckinney Member Posts: 15  Maven
edited May 23 in Help
I’m having some issues with the EMClustering operator. I am using StringTextInput to do some text mining. That operator sends about 1,700 variables to the SVDReduction operator which then reduces the data to 15 variables. Then the EMClustering operator attempts to cluster about 500 examples based on the 15 variables from SVDReduction.

There seems to be a trade off between the numbers of clusters I can request in EMClustering the number of variables output by SVDReduction. If I ask EMClustering for just 5 clusters, then I can have SVDReduction output as many as 25 variables. But if I ask EMClustering for 10 clusters, the max number of variables it will accept from SVDReduction is 15. If SVDReduction provides more, say 20 variables, then I get the error message below. I have tried increasing the max_runs and max_optimization_steps, and that helps a little, but not doesn’t increase the number of variables that EMClustering will accept as input a great deal.

Currently, I’m asking for 10 clusters from EMClustering, with 10 max_runs and 200 max_optimization_steps. The max number of variables that EMClustering will accept from SVDReduction without the fatal error is 10. Any thoughts on this?

I frequently get this error: "Error: Can't compute the covariance of the matrix. Maybe the matrix is singular. Changing option "correlated_attributes" to false." But when I select OK, the program finishes and I get one cluster with every example in it.


G Jul 17, 2009 9:20:56 AM: [Fatal] NullPointerException occured in 1st application of EMClustering (EMClustering)
G Jul 17, 2009 9:20:56 AM: [Fatal] Process failed: operator cannot be executed. Check the log messages...
         Root[1] (Process)
         +- ExampleSource[1] (ExampleSource)
         +- StringTextInput[1] (StringTextInput)
         |  +- StringTokenizer[940] (StringTokenizer)
         |  +- EnglishStopwordFilter[940] (EnglishStopwordFilter)
         |  +- TokenLengthFilter[940] (TokenLengthFilter)
         |  +- PorterStemmer[940] (PorterStemmer)
         +- SVDReduction[1] (SVDReduction)
here ==> +- EMClustering[1] (EMClustering)
         +- ExcelExampleSetWriter[0] (ExcelExampleSetWriter)
<operator name="Root" class="Process" expanded="yes">
   <parameter key="logverbosity" value="error"/>
   <operator name="ExampleSource" class="ExampleSource">
       <parameter key="attributes" value="C:\Documents and Settings\rkenney\My Documents\rm_workspace\Comments09_2.aml"/>
   </operator>
   <operator name="StringTextInput" class="StringTextInput" expanded="yes">
       <parameter key="remove_original_attributes" value="true"/>
       <parameter key="default_content_language" value="english"/>
       <list key="namespaces">
       </list>
       <operator name="StringTokenizer" class="StringTokenizer">
       </operator>
       <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
       </operator>
       <operator name="TokenLengthFilter" class="TokenLengthFilter">
           <parameter key="min_chars" value="3"/>
       </operator>
       <operator name="PorterStemmer" class="PorterStemmer">
       </operator>
   </operator>
   <operator name="SVDReduction" class="SVDReduction">
       <parameter key="keep_example_set" value="true"/>
       <parameter key="return_preprocessing_model" value="true"/>
       <parameter key="dimensions" value="15"/>
   </operator>
   <operator name="EMClustering" class="EMClustering">
       <parameter key="k" value="10"/>
       <parameter key="max_runs" value="30"/>
       <parameter key="max_optimization_steps" value="200"/>
   </operator>
   <operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
       <parameter key="excel_file" value="C:\Projects\Memb Sat Survey\2009\Data\RapidMinerOutput\RMClusters.xls"/>
   </operator>
</operator>

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi,
    would it be possible to send me the reduced data set? Then I would be able to reproduce the error and take a look at the code.

    Greetings,
      Sebastian
  • rdmckinneyrdmckinney Member Posts: 15  Maven
    Sebastion, Thanks! How would you like me to send it and in what format? Also, do you want the output from the StringTextInput operator or the the SVDReduction operator?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi,
    please save the example set produced by the SVDReduction using the ExampleSetWriter and either upload it anywhere and share the link with me, or compress and send it via mail at my email adress.

    Greetings,
      Sebastian


  • rdmckinneyrdmckinney Member Posts: 15  Maven
    I'll have to send it by email. What email address do you want me to use?
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    I'll send it by pm...

    Greetings,
      Sebastian
Sign In or Register to comment.