Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Define established terms
Limegreenman900
Member Posts: 6 Contributor II
Hi everyone,
does anybody know whether RM has an operator or a setting inside an operator where I can define established termns? I am currently extracting text from HTML files with the "Cut Document" Operator and inside that I am using the "Extract Content" Operator from the Web Mining extensions, after that I am doing some routine things like "Replace Tokens", "Tokenize" and "Extract Token Number". As I do have some terms in my text that are normally seen as an established term I wondered whether this is possible in RM?
Example:
Generally Accepted Accounting Practice
International Standards on Auditing
....
Until now, due to tokenization, every word is a single token but it would be great to have these expressions be seen as one token.
I know I could use the "Replace Token" operator and replace every term with an abbreviation like "International Standards on Auditing" = "ISA" but that is not what I want.
Any help appreciated!
does anybody know whether RM has an operator or a setting inside an operator where I can define established termns? I am currently extracting text from HTML files with the "Cut Document" Operator and inside that I am using the "Extract Content" Operator from the Web Mining extensions, after that I am doing some routine things like "Replace Tokens", "Tokenize" and "Extract Token Number". As I do have some terms in my text that are normally seen as an established term I wondered whether this is possible in RM?
Example:
Generally Accepted Accounting Practice
International Standards on Auditing
....
Until now, due to tokenization, every word is a single token but it would be great to have these expressions be seen as one token.
I know I could use the "Replace Token" operator and replace every term with an abbreviation like "International Standards on Auditing" = "ISA" but that is not what I want.
Any help appreciated!
0
Answers
So:
Generally Accepted Accounting Practice = Generally_Accepted_Accounting_Practice
International Standards on Auditing = International_Standards_on_Auditing
At the end of your processing you can then run a replace tokens again and swap out the '_' for a ' ' so it will return to the established term again.
Thanks for your hint on that!