Calculate number of unique words in text and number of repeating paragraphs
How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,365 RM Data Scientist
you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.
~Martin- Head of Data Science Services at RapidMiner -
Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.
Simply tokenize on linguistic sentences and do the same trick as for words.
Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?
you are always allowed to ask questions - that's what we are here for . The only question is if we can answer them.
i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.