Options

text mining and words counting problem

PatrickHouPatrickHou Member Posts: 6 Contributor I
edited December 2018 in Help

Hi 

 

I'm new to rapidminer and I have an analysis now with several txt document. Let's say I have get the most 20 frequently appear words and I want to know (and only know) how many times they show up in each document, can some one give me some ideas?

 

Also I have a problem that I find "united", "states" and "united_states" all appear in my result but I can't just replace them because not all "united" are related to "united states". How can I drag those "united_states" out without counting on "united" and "states"?

 

Thanks 

Patrick

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    For your first question, when you use Process Documents and supply a specific wordlist to use (your 20 words) and then compute the word vector using Term Occurrences.

    For your second question, you can use Generate N-Grams after you Tokenize (and do other text preprocessing) which will give you a separate token for "united_states" than either "united" or "states".

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    PatrickHouPatrickHou Member Posts: 6 Contributor I

    Thanks for the reply!

     

    I have already used term occurace but that gave me overall occurace for my word and I want to know the word occurace in each document(I have about 50 files).

     

    For second question, is that means those "united" and "states" are not related to "united_states"?

     

    Patrick

  • Options
    PatrickHouPatrickHou Member Posts: 6 Contributor I

    I looked up into ducoments and it seems when I use n-Gram opperator all word no matter if they are related, that means I need a filter or purne for those words I think? But how?

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You might want to post your process XML (see the instructions in the right sidebar), since the count should be generated for each document assuming each document is a separate entity in your input data.  Do you have the "create word vector" parameter checked?

    The single counts are not exclusive of the n-gram, but the exclusive uses can be easily calculated via subtraction.  So if there are 10 total occurrences of "united" and 6 occurrences of "united_states" then you know that 4 of the "united" occurrences were not associated with "united_states".

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    PatrickHouPatrickHou Member Posts: 6 Contributor I

    I found that stopwords(dictionary) can do the trick by manually add words I don't need after all in process documents. For a small case I'm doing it's enough but I'll still look for operators may deal with this problem.

     

    Thank you.

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

Sign In or Register to comment.