clustering - how to customize similarity function

LeMarc · May 2020

Hi

so clustering is basically based on similarity/distance functions in which one example is compared to all the other examples of the data set.

Now I want a similiraty function in which the cell of an attribute/feature is not compared to all the other examples - instead that particular cell should be compared to only a given range of defined values. For example: a cell can have two possible values [yes,no]. So the similiarity function should compare the given cell value with just the values 'yes' and 'no'.

Is this possible with RM? If so, how?

Thank you!

Telcontar120 · May 2020

If I have understood what you are trying to do, I think you can replicate what you want by simply creating a new attribute that specifies whether a given attribute value is contained in a reference set. You can do this using Generate Attributes and the "contains" function. After doing that (and you can loop through any set of attributes for doing this) you will get a set of yes/no attributes which you can then use to do your clustering as opposed to your original attributes.

Telcontar120 · May 2020

@LeMarc You can chain several "contains" functions inside an IF statement to cover as many allowed values as you like. I think the approach I described will work for you; you will end up with a new set of attributes that tell you whether each of the underlying attributes contains a valid response or not (binary yes/no).
I'm not convinced that clustering is going to be the best way to handle this problem, though. You might want to look at some of the outlier detection algorithms as well.

MartinLiebig · May 2020

Hi,

i do not completly understand what you want to do, but i think this is simply not a distance measure? As you propably know, distance measures have a few assumptions..

Technically you can register new distances, which are then available in all operators. But this requieres java and it feels like this is not want you want to do.

Cheers,

Martin

LeMarc · May 2020

@mschmitz & @Telcontar120

I have an example set in which some data errors exist within the cell values (e.g. spelling mistake, empty, wrong content , etc.). The goal is to detect examples with data issues.

My approach in using clustering method is to create a cluster with all examples which do not have data errors at all [Cluster_0]. Members of the other clusters [Cluster_1, Cluster_2] consist of examples with at least 1 data error. Below is an example set.

Image: https://us.v-cdn.net/6030995/uploads/editor/6e/hhwzvr8zi1z6.png

The method is to change the way HOW the value of the similarity function [1, 0, 0,333] for an example is calculated. Since based on the similarity the clusters are build. (1) Therefore one need to define a similarity function for each (categorical) attribute type in which the cell value is compared to a defined range of values instead of comparing the cell value to all the other existing values of that particular attribute (see below) within the example set. For Married it would be [yes, no]. For Colour [Blue, Red] and Job [Yes, No]. M stands for married.

Image: https://us.v-cdn.net/6030995/uploads/editor/km/6ilt7ith1rc8.png

So if the cell value [3/Married] contains e.g. 'yes' or 'no' than the output would be '1' (see below). However if the cell value would be e.g. '-1' than the output would be '0'.

Image: https://us.v-cdn.net/6030995/uploads/editor/v2/zgsj6bb28p58.png

The idea is: every time a cell value does not conform to the defined range value, the similarity value of the whole example decreases and will be <1 and therefore exclude the example from the cluster_0 where there is no data errors (similarity = 1). Does this make sense/work?

So the question is how to implement (1)?

@Telcontar120 Btw what is the function if I want to include more than just one value to compare? e.g. contains ([Married], "yes" ??

MartinLiebig · May 2020

Hi @LeMarc ,

this sounds more like you want to use CrossDistance to a Reference table?

Cheers,

Martin

LeMarc · May 2020

Sorry for the late reply. Its been quite busy here.

Thank you @Telcontar120 for your help. It is appreciated.

@mschmitz Yes, very similar to CrossDistance to a Reference Table because I try to find an clustering approach which is independet of clustering algorithm, settings and clustering method.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

clustering - how to customize similarity function

Best Answers

Answers