Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Define established terms

Limegreenman900Limegreenman900 Member Posts: 6 Contributor II
edited November 2018 in Help
Hi everyone,

does anybody know whether RM has an operator or a setting inside an operator where I can define established termns? I am currently extracting text from HTML files with the "Cut Document" Operator and inside that I am using the "Extract Content" Operator from the Web Mining extensions, after that I am doing some routine things like "Replace Tokens", "Tokenize" and "Extract Token Number". As I do have some terms in my text that are normally seen as an established term I wondered whether this is possible in RM?

Example:
Generally Accepted Accounting Practice
International Standards on Auditing
....

Until now, due to tokenization, every word is a single token but it would be great to have these expressions be seen as one token.
I know I could use the "Replace Token" operator and replace every term with an abbreviation like "International Standards on Auditing" = "ISA" but that is not what I want.

Any help appreciated!

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Why not use the replace token operator and instead of replacing as an abbreviation? 

    So:
    Generally Accepted Accounting Practice = Generally_Accepted_Accounting_Practice
    International Standards on Auditing = International_Standards_on_Auditing

    At the end of your processing you can then run a replace tokens again and swap out the '_' for a ' ' so it will return to the established term again.   
  • Limegreenman900Limegreenman900 Member Posts: 6 Contributor II
    You are right, I totally ignored the option using underlines to connect the words  :)
    Thanks for your hint on that!
Sign In or Register to comment.