Basic Question - replacing words in a document

apaul · March 2018

Hi Experts,

I have set of ducments and would like to replace some of the word sets with a single word before tokenize.

ex. follow up --> follow-up

Set up --> Setup

How do I do this?

Thanks,

Aji

Telcontar120 · March 2018

I don't know an easy way to do this in RapidMiner before you process the document without a lot of very complicated regular expression matching. But it is easy to do while you process the document----just add Generate n-grams of length 2 to your process, and then use Replace Tokens after that to substitute the ones that you want.

kayman · March 2018

I would suggest to use the replace (Dictionary) operator. You first create a simple csv with 2 columns, the first containing your current word(s) and the second your replacement word. If there are not too many words to replace you don't have to worry too much about regular expressions, just have all the variations included. That's still manageable. Just take care of partial replacements so use your words wise. You could use a very basic regex character, the word boundery (\b when using in an operator, \\b when using in a text file), to ensure the words are considered as a whole.

this would be something like :

From To

\\bMy word\\b my-word

\\bsample-data\\b sample data

next add this csv to the operator, set the from and to column headers, mark the 'use regular expressions" box and off you go.

If you want to cover all of the possible typo's etc then more complex regular expressions can be an option, but if not keep it simple and use the boundery character to keep your words complete.

apaul · March 2018

Thanks kayman !

Nice suggestion . But not working as expected meaning not replacing the words. Also how do I tokenize an example set?

@kayman wrote:
I would suggest to use the replace (Dictionary) operator. You first create a simple csv with 2 columns, the first containing your current word(s) and the second your replacement word. If there are not too many words to replace you don't have to worry too much about regular expressions, just have all the variations included. That's still manageable. Just take care of partial replacements so use your words wise. You could use a very basic regex character, the word boundery (\b when using in an operator, \\b when using in a text file), to ensure the words are considered as a whole.

this would be something like :

From To

\\bMy word\\b my-word

\\bsample-data\\b sample data

next add this csv to the operator, set the from and to column headers, mark the 'use regular expressions" box and off you go.

If you want to cover all of the possible typo's etc then more complex regular expressions can be an option, but if not keep it simple and use the boundery character to keep your words complete.

kayman · March 2018

What input does your document process operator get? Seems like there is something missing there.

Apart from that good point, as I overlooked the fact you are working with documents and not data. Unfortunatly the text operators don't have a real dictionary driven replace option for this, though you could abuse the Stem (dictionary) for that to some extend.

One option could be to use the documents to data operator, this will convert your full document to a data value, and this way you can use the replace(dictionary) option. Next you can use the process Documents from Data to process your cleaned content. Important is to ensure your data field is defined as text and not as nominal, as otherwise the process will happily ignore your content.

A typical tokenization workflow would be something like transform cases -> tokenize (on words or spaces or line breaks or whatever makes sense) -> stopwords -> stemming (if readability is less important) inside the Documents from Data operator and prune using some try and error settings.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Basic Question - replacing words in a document

Answers