Define established terms

Limegreenman900Limegreenman900 Member Posts: 6 Contributor II
edited November 2018 in Help
Hi everyone,

does anybody know whether RM has an operator or a setting inside an operator where I can define established termns? I am currently extracting text from HTML files with the "Cut Document" Operator and inside that I am using the "Extract Content" Operator from the Web Mining extensions, after that I am doing some routine things like "Replace Tokens", "Tokenize" and "Extract Token Number". As I do have some terms in my text that are normally seen as an established term I wondered whether this is possible in RM?

Example:
Generally Accepted Accounting Practice
International Standards on Auditing
....

Until now, due to tokenization, every word is a single token but it would be great to have these expressions be seen as one token.
I know I could use the "Replace Token" operator and replace every term with an abbreviation like "International Standards on Auditing" = "ISA" but that is not what I want.

Any help appreciated!

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Why not use the replace token operator and instead of replacing as an abbreviation? 

    So:
    Generally Accepted Accounting Practice = Generally_Accepted_Accounting_Practice
    International Standards on Auditing = International_Standards_on_Auditing

    At the end of your processing you can then run a replace tokens again and swap out the '_' for a ' ' so it will return to the established term again.   
  • Limegreenman900Limegreenman900 Member Posts: 6 Contributor II
    You are right, I totally ignored the option using underlines to connect the words  :)
    Thanks for your hint on that!
Sign In or Register to comment.