getting distinct tokens

kayman · June 2016

I'm using the text processing modules and the idea is to get unique keywords per document, so basically for every document I have I want to get an attribute containing keywords for the full text field.

My workflow is fairly straightforward,I loop though all the examples, convert data to document, filter on some relevant POS tags, remove all stopwords etc, convert back to data,and append them all together. This works pretty fine but the result still contains duplicates, as in below example :

original : this is just a test sentence to do a test to check the process

keywords : test sentence test check process

wanted result : test sentence check process

How can I get rid of the duplicate tokens? I could eventually do it with some monster regex, but this will be fairly expensive I guess. Are there better ways to achieve this?

bhupendra_patil · June 2016

hello @kayman

May be you already know this but just confirming

I think that should happen automatically, what are your settings for your tokenize step.

Also is there a differene in case in the output. Unless you use the "transform cases" opertor "text" and "Text" are considered different

kayman · June 2016

Hi, Im aware of the case difference, basically my workflow is all to lowercase, tokenize on spaces and up to next one. The output will however contain all tokens (which is probably logical as the document processor needs to be able to define the amount of times a given word is used in a given document.

In the meantime I found a way around myself, with using the Wordlist to Data operator. So first I loop through all the examples, convert the example to a document, clean the data, generate a wordlist, convert wordlist to data and loop through this (wordlist) example set. This gives me now the unique values, and I convert these back to an example. In the end I get my original example set with my keyword attribute.

It does do the trick but is fairly slow (20K documents per hour) so if anybody has a more effficient way to do this I'd be happy to learn about it.

land · June 2016

Hi,

perhaps this will help you to speed up the concatenation:

Instead of looping over the word list, rather use the Aggregation operator with the concat aggreation function after transforming the word list into data. This should be way faster.

Greetings,

Sebastian

kayman · November 2016

Actually i found a much better and faster solution, so if anybody ever has the same issue like me :

Use a regular expression as this one : \b(\w+)\s(?=.*\b\1:?) and replace by a space. This will only keep the unique (distinct) words in any given string. Note that only the last match of a given word will be kept, so if the order is important you need to handle with care.

land · November 2016

Wow, that's an impressive expression...

Isn't that suffering heavily from long texts in terms of runtime? I would envison a special operator being several thousand percent faster

Greetings,

Sebastian

kayman · November 2016

Yeah, I'd love to have a dedicated component also. Shouldn't be to hard to add a 'remove duplicates' block in the next text analysis update (hint hint)

Or maybe I just didn't grasp your previous solution completely correct, I was struggling with it and in the end I gave up. Till now as I needed it again :-)

Now, for me it works fairly fine as we have relative small sets and strings (couple of thousand records with few sentences) so it works out ok. Better and faster than my original approach at least. And if anybody has a better approach I'm always intrested.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

getting distinct tokens

Answers