Result of distance operator can not be reproduced

eifel_wolf · February 2014

Hello,

I'm using a distance operator to match some element, but unfortunately I get strange results and I'm not able to reproduce them. In the special case I use Jaccard similiarity for the distance, but even other methods give only strange results.

As far as I understand this method it takes all attributes from two exampels and matches them: Same attribute in both examples -> hit; Attribute only in one example -> miss. In the end the distance is calculated by number of hits divided by number of used attributes. 6 matching attributes out of 13 attributes = 0,46. I've debuged my process and checked the cases manually - I won't get any result from rapid miner which is corresponding to this calculation. There are cases with no hit out of x attributes and they get a result in the range 0,35 or 0,4, and in other cases there is one hit out of x and the given distance (or similarity) is less than in the cases before.

I'm totally confused about this. Is anybody out there who knows about this strange behaviour and how to solve it?

Regards

Mario

David_A · February 2014

Hello Mario,

I've justt ried to reproduce your problem and for my test process the operator works fine.
One possible issue could be that the distance operator does not consider special attribute types (like id, label, cluster) when calculating the distance.
Could it be that you have these attributes in your data set and use them when trying to reproduce the output?

Regards,
David

eifel_wolf · February 2014

Hello,

thank you for your answer. I just rechecked the process, the attributes to be used for distance operations are all of role "regular" and type "binominal". As far as I know this is OK for this case.

Regards

Mario

eifel_wolf · February 2014

Hello again,

I've done some further checks about this function, but I still got no clear picture. I've created a controlled environment running dedicated check data through my process and I've checked data within process at several checkpoints.

First I've runned two identic datasets I've got a perfect match for the identic items. In example: one record got 7 attributes and matching it with itself there are seven equal attributes. Even non identic items got good similaritys, but as long as I take the best value it fits.

Then I've started making the datasets defined unidentic, this means I've changed one attribute in one of the datasets. By this I got one attribute more and one match less for the same items. In example according to the first one: Now there are 8 attributes to compare for the record, and only 6 of them match. This should give a similarty 6/8 = 0,75, but I get the value of ~0.724.

If do several loops making the example less matching in each loop this effect will growth, the values I calculate manually with the Jaccard formula differ more from the values operator delivers. It seems to me that this operator does not only a simple match between two records of the datasets, it seems to me that there also happens a modification on the distance value itself.

In the examples before I always changed one attribute of all records (five for each dataset), now I've tried to change only one record and leave the others as they are. In the result i get a perfect match for the record I've changed (made a little bit unequal), but the other ones I didn't touch have lost "similarity", which means the quality of matches decreased without a change to this examples.

I don't understand this behaviour ... are there any ideas here? Or workarounds how I can find out the most matching rows out of two example sets?

Regards

Mario

David_A · February 2014

I think the problem is that the formula for the Nominal JaccardDistance is:

equalNonFalseValues / (equalNonFalseValues + unequalValues)

So for example if from ten attributes there is only one pair where both are true, three pairs where both are false and sixwhere the pairs are different, the result will be 1/(1+6)=0.143.

If you want the number of equal pairs (regardless their value) you should use see SimpleMatching Distance (which will give you a distance of 0.4 in the example above).

I hope this explains the difference in the results you have observed.

Best regards,
David

eifel_wolf · February 2014

Hello,

thank you for your answer! Looking at the example you've given and the result you've calculated the JaccardDistance is exactly the thing I will need. Unfortunately the result I get from operator do not match my expectation.

If I understand you example and calculation in the right way you would also expect that JaccardDistance operator matches two example and gives a result for the example without taking care of all other examples in the dataset. But the situation I found looks like it is doing an additional "weighting" of results based on something related to results of all examples ...

Please have a look at my last example from my post before: I've good 2 identic datasets and I found a perfect match between the same elements of both sets. Then I changed one attribute for one id in one of the datasets, making it less matching (6 from 8 instead of 7 from 7 attributes). After this the example I changed is still a perfect match, but all the other pairs with same id and same attributes are no longer given as perferct match, they've lost quality of match.

Either this is a bug or I've got a very big failure in my understanding of the operator ... and as the operator should be used successfully by others people too I believe I must be wrong - but I want to understand how ...

Regards

Mario

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Result of distance operator can not be reproduced

Answers