Problems with Auto Model Cluster Analysis

TerpdogTerpdog Member, University Professor Posts: 15 University Professor
edited May 2020 in Help
"I am using Auto Model to do a k-means cluster analysis. Works fine for 2 clusters. With 3 or more clusters or or more cluster has an average distance of ? and a Davies-Bouldin index of infinity. This appeared before and I thought Version 9.6 had fixed it but apparently not. It also appears in the beta of 9.7. Is there a way around this? Thanks."

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @Terpdog,

    Can you share your data in order we can reproduce and understand what's going on ?

    Regards,

    Lionel
  • TerpdogTerpdog Member, University Professor Posts: 15 University Professor
    I am not sure what files are needed but I have attached the only rapidminer file I could find and also an Excel file of the data. I was using only the first four variables for the cluster analysis.
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @Terpdog,

    Thank you for sharing your data.
    I can reproduce what you observe : 


     But there is something strange in Auto-Model itself because
    if I'm using your data (only the first four variables) with a k-Means model (with k = 3, 4,etc) in a classic RapidMiner process,
    the results are correct (ie I obtain finite values for DB index and average distances) : 



    Has someone an idea of what's going on in Auto-Model (clustering) ?

    In attached file, the classic (working) process in RapidMiner.

    Regards,

    Lionel



  • TerpdogTerpdog Member, University Professor Posts: 15 University Professor
    edited May 2020
    Thanks Lionel. I did not think to try the process route. There has to be a bug in the Auto-Model routine. Hopefully that can get fixed. There is still a question of why the distances are negative which does not make sense.
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    @Terpdog,

    The "real" distances are, of course, positive.
    It seems to me that RapidMiner multiply the distances by minus one (-1) in order to work with negative values because
    RapidMiner's algorithms are searching to MAXIMIZE these values. (explanation to be confirmed by the RM staff, @sgenzer ?)

    Regards,

    Lionel
  • TerpdogTerpdog Member, University Professor Posts: 15 University Professor
    That makes sense. I am continually frustrated at how hard it is to get routine statistics following an analysis in RapidMiner. I am trying to use this in my book which talks about measures of fit in techniques such as cluster analysis, discriminant analysis and logistic regression and I can't get RapidMiner to produce them or it is so difficult it would be of no use to students. I may have to drop the idea of using it. Too bad.
Sign In or Register to comment.