🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

Calculate number of unique words in text and number of repeating paragraphs

In777In777 Member Posts: 29 Contributor I
edited November 2018 in Help

How can I calculate the number of unique mentions of each words (tokens without stopwords) in each text document? Besides how can I find the number of repeating sentences or paragraphs? Is there any operators in text mining extension?

Best Answer

  • mschmitzmschmitz Posts: 2,156  RM Data Scientist
    Solution Accepted

    Hi,

     

    you can simply use a Process Documents operator with binary occurences and use Generate Aggregation afterwards to get the sum of a row.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany

Answers

  • In777In777 Member Posts: 29 Contributor I

    Thank you I think that will work. And what about repeating sentences? I tried the similarity measure first, but I have too long documents, so it will not work.

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,156  RM Data Scientist

    Hi,

     

    Simply tokenize on linguistic sentences and do the same trick as for words.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • In777In777 Member Posts: 29 Contributor I

    Hi Martin,

     

    Thank you for the answer. I have a follow-up question: if the sentences are not complete the same, but very similar (e.g. 2-3 words are changed), how could I find the repeating text parts then?

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,156  RM Data Scientist

    Hi ln777,

    you are always allowed to ask questions - that's what we are here for :). The only question is if we can answer them.

     

    i would create a similarity/synonym dictionary. I would go for worldist to data, take the sentences as an input for a 2nd process documents, tokenize on words and calculate a cross distance on the result. There i would go for a high cosine similarity to define a "synomym". This dictionary can then be used to replace texts in the original document.

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.