Calculate number of unique words in text and number of repeating paragraphs

In777 · November 2016

How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?

MartinLiebig · November 2016

Hi,

you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.

~Martin

In777 · November 2016

Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.

MartinLiebig · November 2016

Hi,

Simply tokenize on linguistic sentences and do the same trick as for words.

~Martin

In777 · December 2016

Hi Martin,

Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?

MartinLiebig · December 2016

Hi ln777,

you are always allowed to ask questions - that's what we are here for . The only question is if we can answer them.

i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.

~Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Calculate number of unique words in text and number of repeating paragraphs

Best Answer

Answers