how to cluster missing values in one cluster?

LeMarcLeMarc Member Posts: 72 Contributor II
edited April 2020 in Help
Hello,

I would like to have 2 clusters from a data set. Basically one with examples that have missing values and the other with examples which dont have any missing values. As most Clustering algorithms do not allow missing values in data set, those missing values could be replaces by e.g. "0" . However still after that I m clueless on what exactly to do afterwards to have all missing values in one cluster.
Can anyone help?

Thank you!

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,
    why can't you just take a Filter examples operator with "missing_attribute" as filter? That should do the trick.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • LeMarcLeMarc Member Posts: 72 Contributor II

    yes that would be easy. Its just my task to cluster all missing values and not to use the filter operator. :smile:
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,
    well, you can set them to -100000 and then just cluster on it without normalizing. Should do the same trick.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • LeMarcLeMarc Member Posts: 72 Contributor II
    Thank you @mschmitz , will try it!
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited April 2020
    I am not sure if replacing missing values with big numbers will cluster these examples together. However, I feel that you can train an svm radial classifier to separate them from the rest (the intuition is that they'd be all far from the centre of your data). 
  • LeMarcLeMarc Member Posts: 72 Contributor II
    @jacobcybulski Thank you for your input! Im going to try your suggestion!
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    @LeMarc , I am not sure if you have much experience with SVMs, if not, do not get discouraged if the initial results are very poor. You will need to run some optimisation of SVM kernel hyper-parameters. The radial kernel may work here, and if not try anova kernel, which is more sophisticated, and is commonly optimised on kernel.gamma, kernel.degree and C.
    Jacob
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    The most common clustering algorithms are not really designed to handle missing values. So you may be able to "trick" these algorithms into creating a cluster by using artificially high or low values but a better approach would be to use a different method altogether, one designed to actually separate cases and that can more directly handle missing values.  Several of the earlier posts have recommended some of these approaches.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    @Telcontar120, indeed I think this is a discovery project on how to handle missing values differently. The idea with SVM was that if you replace missing values with some big numbers (at least for numerical ones) then all examples with missing values will be pushed far from data centre, in which case an SVM with a radial kernel may help isolate them. 
  • LeMarcLeMarc Member Posts: 72 Contributor II
    Thank you all @jacobcybulski and @ Telecontar120 thank you for your input! Thanks for your explanation of how SVM works! Yes indeed, this is a discovery project of mine!
Sign In or Register to comment.