Options

getting distinct tokens

kaymankayman Member Posts: 662 Unicorn
edited November 2018 in Help

I'm using the text processing modules and the idea is to get unique keywords per document, so basically for every document I have I want to get an attribute containing keywords for the full text field.

 

My workflow is fairly straightforward,I loop though all the examples, convert data to document, filter on some relevant POS tags, remove all stopwords etc, convert back to data,and append them all together. This works pretty fine but the result still contains duplicates, as in below example :

 

original : this is just a test sentence to do a test to check the process

keywords : test sentence test check process

 

wanted result : test sentence check process

 

How can I get rid of the duplicate tokens? I could eventually do it with some monster regex, but this will be fairly expensive I guess. Are there better ways to achieve this?

Answers

  • Options
    bhupendra_patilbhupendra_patil Administrator, Employee, Member Posts: 168 RM Data Scientist

    hello @kayman

     

    May be you already know this but just confirming

    I think that should happen automatically, what are your settings for your tokenize step.

    Also is there a differene in case in the output. Unless you use the "transform cases" opertor "text" and "Text" are considered different 

  • Options
    kaymankayman Member Posts: 662 Unicorn

    Hi, Im aware of the case difference, basically my workflow is all to lowercase, tokenize on spaces and up to next one. The output will however contain all tokens (which is probably logical as the document processor needs to be able to define the amount of times a given word is used in a given document.

    In the meantime I found a way around myself, with using the Wordlist to Data operator. So first I loop through all the examples, convert the example to a document, clean the data, generate a wordlist, convert wordlist to data and loop through this (wordlist) example set. This gives me now the unique values, and I convert these back to an example. In the end I get my original example set with my keyword attribute. 

    It does do the trick but is fairly slow (20K documents per hour) so if anybody has a more effficient way to do this I'd be happy to learn about it.

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Hi,

     

    perhaps this will help you to speed up the concatenation:

    Instead of looping over the word list, rather use the Aggregation operator with the concat aggreation function after transforming the word list into data. This should be way faster.

     

    Greetings,

      Sebastian

     

  • Options
    kaymankayman Member Posts: 662 Unicorn

    Actually i found a much better and faster solution, so if anybody ever has the same issue like me :

    Use a regular expression as this one : \b(\w+)\s(?=.*\b\1:?) and replace by a space. This will only keep the unique (distinct) words in any given string. Note that only the last match of a given word will be kept, so if the order is important you need to handle with care.

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Wow, that's an impressive expression...

    Isn't that suffering heavily from long texts in terms of runtime? I would envison a special operator being several thousand percent faster :)

     

    Greetings,

      Sebastian

  • Options
    kaymankayman Member Posts: 662 Unicorn

    Yeah, I'd love to have a dedicated component also. Shouldn't be to hard to add a 'remove duplicates' block in the next text analysis update (hint hint)

    Or maybe I just didn't grasp your previous solution completely correct, I was struggling with it and in the end I gave up. Till now as I needed it again :-)

    Now, for me it works fairly fine as we have relative small sets and strings (couple of thousand records with few sentences) so it works out ok. Better and faster than my original approach at least. And if anybody has a better approach I'm always intrested.

Sign In or Register to comment.