Options

Replace Token Solution for abbreviates that are part of other words

TRSisme05TRSisme05 Member Posts: 2 Learner I
I am working on a text analysis. There are abbreviations in my original text, such as cust or cust. for customer. I can put the replace token operator before the tokenize operator and enter multiple replacements such as replace cust space with customer and cust. with customer, but I am curious if there is a way to do it after the tokenization because it has "grouped" the cust abbreviations together. I did try placing the replace operator after tokenization but it replaced all occurrences of cust with customer, including the full word customer. Any thoughts/ideas?  thank you for your help.

 

Best Answer

Answers

  • Options
    kaymankayman Member Posts: 662 Unicorn
    One option is to create your own stemming library, as in the end that's your goal here, to group similar words. The stem dictionary would be able to do that but it requires setting up a library yourself adding all of the words and can be tricky if characters are shared.

    So adding cust* would stem customer, customers etc to cust (or whatever you choose) , but it would do the same with customs or custody so be careful.

    Replacing using regex might be more secure, just ensure you use (word) boundaries in that case, so \bcust\b would replace only cust to customer, and leave all other words containing cust untouched. 
  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @TRSisme05,

    To perform your task use the following regex in the Replace operator :



    Regards,

    Lionel 


  • Options
    TRSisme05TRSisme05 Member Posts: 2 Learner I
    lionelderkrikor (cust)+$ does not replace cust within other words, but it only replaces cust space I still need to create multiple replacements for cust. etc
Sign In or Register to comment.