Identify similar strings of only one attribute

aandreal · November 2020

Hello,
I would like to identify a degree of similarity between strings all belonging to a single attribute of type text. The reason is that I have strings that present tests performed in the hospital in the form: exam_a;exam_b;exam_c. I would also like to identify when they occur in different order but always with the same elements: exam_c;exam_b;exam_a.

Please help me.

Thanks

MartinLiebig · November 2020

Hi,

Have a look at the operator fuzzy matching and Generate Levensthein Distance in operator toolbox extension. I think what you want to do is to replace the ; with a space and then do a fuzzy matching using TOKEN_SET_RATIO or so as a measure.

Cheers,

Martin

aandreal · November 2020

Hi @mschmitz,

thanks for the reply.

It can help me but not quite what I want to do. I have situations in which I have strings of length 1 but also of length 20 (depending on the number of exams). Besides that, I have situations of missing values. I considered the Jaccard index idea by working on values separated by; but what happens is a word-by-word comparison (taking into account that by splitting the shorter strings are still commensurate with the longer string by adding missing values). I would like to think in terms of sets, then compare the words of one string with the words of a second string. What do you think about it? How could I do it?

MartinLiebig · November 2020

Hi,

Jaccard-Index or cosine similarity of 1 hot encoded values maybe also viable candidates for a solution, yes.

Best,

Martin

aandreal · November 2020

Thanks @mschmitz
I think I found the solution with Jaccard. However, before applying it, I would like to sort the data. To do this I am transposing and then sorting the columns. I have a problem with the transpose: applying the operator I am shown only the column of type ID and I cannot find all the other necessary columns. Why?

MartinLiebig · November 2020

Hi,

you can still type in the names, they do exist.

We cannot know from the information of the header what columns will be created after transposing the table. Thats why we sadly cannot display them.

Best,

Martin

aandreal · November 2020

OMG, @mschmitz.

So if I have 1158 attributes, do I have to do 1158 sort? My idea was to use a Loop.

MartinLiebig · November 2020

of course you use loops and macros for it. nobody wants to do things 1158 times.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Identify similar strings of only one attribute

Best Answer

Answers