RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance


Grouping profiles strings having the same words, but occurring out of order Python

RobertdRobertd Member Posts: 1 Newbie
edited October 19 in Help

I have a data frame containing a column of profile types, which looks like this:

left_side                       right_side                  similarity</code>0         Android Java
1                  Software Development Developer
2                            Full-stack Developer
3                      JavaScript Frontend Design
4                          Android iOS JavaScript
5                             Ruby JavaScript PHP</pre><div><code><p>I've used NLP to fuzzy match similar profiles, which returned the following similarity dataframe:</p><div><pre class="CodeBlock"><code>
7   JavaScript Frontend Design  Design JavaScript Frontend  0.849943
8   JavaScript Frontend Design  Frontend Design JavaScript  0.814599
9   JavaScript Frontend Design  JavaScript Frontend         0.808010
10  JavaScript Frontend Design  Frontend JavaScript Design  0.802881
12  Android iOS JavaScript      Android iOS Java            0.925126
15  Machine Learning Engineer   Machine Learning Developer  0.839165
21  Android Developer Developer Android Developer           0.872646
25  Design Marketing Testing    Design Marketing            0.817195
28  Quality Assurance           Quality Assurance Developer 0.948010

While this has helped, taking me from 478 unique profile to 461, what I'd want to focus on are profiles like this:

Frontend Design JavaScript  Design Frontend JavaScript<br>

The only tool I've seen which looks to address this problem is difflib? My question is, what other techniques would be available so as to go through and standardize these profiles that consist ofย the same words, but out of order,ย to one standard string. So desired output would be, taking a string containing "Design", "Frontend" and "JavaScript" and replacing it with "Design Frontend JavaScript".

Right now, I'm merging my original dataframe with the similarity dataframe to replace all occurrences of profile string on the right_side with the left_side, but that means I'm replacing the right_side below ("Java Python Data Science") with the left_side below ("JavaScript Python Data Science").

</code>53  JavaScript Python Data Science  Java Python Data Science</pre><p></p><p>Any help would be greatly appreciated!!!</p></div><div><br></div>


  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,764  RM Data Scientist
    if you have toolbox installed, then you have a function in Generate Attribute called fuzzy_match. This has some options explained here: https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
    ย which should cover exactly this.

    Toolbox also has a fuzzy match operator which can be useful here (and uses the same functions).


    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.