🦉🦉   WOOT WOOT!   RAPIDMINER WISDOM 2020 EARLY BIRD REGISTRATION ENDS FRIDAY DEC 13!   REGISTER NOW!   🦉🦉

Calculate number of unique words in text and number of repeating paragraphs

In777In777 Member Posts: 29 Contributor I
edited November 2018 in Help

How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?

Best Answer

  • mschmitzmschmitz Posts: 2,203  RM Data Scientist
    Solution Accepted

    Hi,

     

    you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany

Answers

  • In777In777 Member Posts: 29 Contributor I

    Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,203  RM Data Scientist

    Hi,

     

    Simply tokenize on linguistic sentences and do the same trick as for words.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • In777In777 Member Posts: 29 Contributor I

    Hi Martin,

     

    Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,203  RM Data Scientist

    Hi ln777,

    you are always allowed to ask questions - that's what we are here for :). The only question is if we can answer them.

     

    i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.