Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Identify similar strings of only one attribute
Hello,
I would like to identify a degree of similarity between strings all belonging to a single attribute of type text. The reason is that I have strings that present tests performed in the hospital in the form: exam_a;exam_b;exam_c. I would also like to identify when they occur in different order but always with the same elements: exam_c;exam_b;exam_a.
I would like to identify a degree of similarity between strings all belonging to a single attribute of type text. The reason is that I have strings that present tests performed in the hospital in the form: exam_a;exam_b;exam_c. I would also like to identify when they occur in different order but always with the same elements: exam_c;exam_b;exam_a.
Please help me.
Thanks
Tagged:
0
Best Answer
-
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,527 RM Data ScientistHi,Have a look at the operator fuzzy matching and Generate Levensthein Distance in operator toolbox extension. I think what you want to do is to replace the ; with a space and then do a fuzzy matching using TOKEN_SET_RATIO or so as a measure.Cheers,Martin- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
It can help me but not quite what I want to do. I have situations in which I have strings of length 1 but also of length 20 (depending on the number of exams). Besides that, I have situations of missing values. I considered the Jaccard index idea by working on values separated by; but what happens is a word-by-word comparison (taking into account that by splitting the shorter strings are still commensurate with the longer string by adding missing values). I would like to think in terms of sets, then compare the words of one string with the words of a second string. What do you think about it? How could I do it?
Dortmund, Germany
I think I found the solution with Jaccard. However, before applying it, I would like to sort the data. To do this I am transposing and then sorting the columns. I have a problem with the transpose: applying the operator I am shown only the column of type ID and I cannot find all the other necessary columns. Why?
Dortmund, Germany
So if I have 1158 attributes, do I have to do 1158 sort? My idea was to use a Loop.
Dortmund, Germany