getting distinct tokens
I'm using the text processing modules and the idea is to get unique keywords per document, so basically for every document I have I want to get an attribute containing keywords for the full text field.
My workflow is fairly straightforward,I loop though all the examples, convert data to document, filter on some relevant POS tags, remove all stopwords etc, convert back to data,and append them all together. This works pretty fine but the result still contains duplicates, as in below example :
original : this is just a test sentence to do a test to check the process
keywords : test sentence test check process
wanted result : test sentence check process
How can I get rid of the duplicate tokens? I could eventually do it with some monster regex, but this will be fairly expensive I guess. Are there better ways to achieve this?