how to cluster missing values in one cluster?

LeMarc · April 2020

Hello,

I would like to have 2 clusters from a data set. Basically one with examples that have missing values and the other with examples which dont have any missing values. As most Clustering algorithms do not allow missing values in data set, those missing values could be replaces by e.g. "0" . However still after that I m clueless on what exactly to do afterwards to have all missing values in one cluster.

Can anyone help?

Thank you!

MartinLiebig · April 2020

Hi,

why can't you just take a Filter examples operator with "missing_attribute" as filter? That should do the trick.

Cheers,

Martin

LeMarc · April 2020

Hi @mschmitz,

yes that would be easy. Its just my task to cluster all missing values and not to use the filter operator.

MartinLiebig · April 2020

Hi,

well, you can set them to -100000 and then just cluster on it without normalizing. Should do the same trick.

~Martin

LeMarc · April 2020

Thank you @mschmitz , will try it!

jacobcybulski · April 2020

I am not sure if replacing missing values with big numbers will cluster these examples together. However, I feel that you can train an svm radial classifier to separate them from the rest (the intuition is that they'd be all far from the centre of your data).

LeMarc · April 2020

@jacobcybulski Thank you for your input! Im going to try your suggestion!

jacobcybulski · April 2020

@LeMarc , I am not sure if you have much experience with SVMs, if not, do not get discouraged if the initial results are very poor. You will need to run some optimisation of SVM kernel hyper-parameters. The radial kernel may work here, and if not try anova kernel, which is more sophisticated, and is commonly optimised on kernel.gamma, kernel.degree and C.

Jacob

Telcontar120 · April 2020

The most common clustering algorithms are not really designed to handle missing values. So you may be able to "trick" these algorithms into creating a cluster by using artificially high or low values but a better approach would be to use a different method altogether, one designed to actually separate cases and that can more directly handle missing values. Several of the earlier posts have recommended some of these approaches.

jacobcybulski · April 2020

@Telcontar120, indeed I think this is a discovery project on how to handle missing values differently. The idea with SVM was that if you replace missing values with some big numbers (at least for numerical ones) then all examples with missing values will be pushed far from data centre, in which case an SVM with a radial kernel may help isolate them.

LeMarc · April 2020

Thank you all @jacobcybulski and @ Telecontar120 thank you for your input! Thanks for your explanation of how SVM works! Yes indeed, this is a discovery project of mine!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

how to cluster missing values in one cluster?

Answers