Identify similar strings of only one attribute

aandrealaandreal Member Posts: 4 Learner I
Hello,
I would like to identify a degree of similarity between strings all belonging to a single attribute of type text. The reason is that I have strings that present tests performed in the hospital in the form: exam_a;exam_b;exam_c. I would also like to identify when they occur in different order but always with the same elements: exam_c;exam_b;exam_a.

Please help me.
Thanks
Tagged:

Best Answer

Answers

  • aandrealaandreal Member Posts: 4 Learner I
    Hi @mschmitz,
    thanks for the reply.

    It can help me but not quite what I want to do. I have situations in which I have strings of length 1 but also of length 20 (depending on the number of exams). Besides that, I have situations of missing values. I considered the Jaccard index idea by working on values separated by; but what happens is a word-by-word comparison (taking into account that by splitting the shorter strings are still commensurate with the longer string by adding missing values). I would like to think in terms of sets, then compare the words of one string with the words of a second string. What do you think about it? How could I do it?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,282 RM Data Scientist
    Hi,
    Jaccard-Index or cosine similarity of 1 hot encoded values maybe also viable candidates for a solution, yes.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • aandrealaandreal Member Posts: 4 Learner I
    Thanks @mschmitz
    I think I found the solution with Jaccard. However, before applying it, I would like to sort the data. To do this I am transposing and then sorting the columns. I have a problem with the transpose: applying the operator I am shown only the column of type ID and I cannot find all the other necessary columns. Why?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,282 RM Data Scientist
    Hi,
    you can still type in the names, they do exist.

    We cannot know from the information of the header what columns will be created after transposing the table. Thats why we sadly cannot display them.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • aandrealaandreal Member Posts: 4 Learner I
    edited November 2020
    OMG, @mschmitz.

    So if I have 1158 attributes, do I have to do 1158 sort? My idea was to use a Loop.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,282 RM Data Scientist
    of course you use loops and macros for it. nobody wants to do things 1158 times.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.