text mining and words counting problem

PatrickHou · January 2018

Hi

I'm new to rapidminer and I have an analysis now with several txt document. Let's say I have get the most 20 frequently appear words and I want to know (and only know) how many times they show up in each document, can some one give me some ideas?

Also I have a problem that I find "united", "states" and "united_states" all appear in my result but I can't just replace them because not all "united" are related to "united states". How can I drag those "united_states" out without counting on "united" and "states"?

Thanks

Patrick

Telcontar120 · January 2018

For your first question, when you use Process Documents and supply a specific wordlist to use (your 20 words) and then compute the word vector using Term Occurrences.

For your second question, you can use Generate N-Grams after you Tokenize (and do other text preprocessing) which will give you a separate token for "united_states" than either "united" or "states".

PatrickHou · January 2018

Thanks for the reply!

I have already used term occurace but that gave me overall occurace for my word and I want to know the word occurace in each document(I have about 50 files).

For second question, is that means those "united" and "states" are not related to "united_states"?

Patrick

PatrickHou · January 2018

I looked up into ducoments and it seems when I use n-Gram opperator all word no matter if they are related, that means I need a filter or purne for those words I think? But how?

Telcontar120 · January 2018

You might want to post your process XML (see the instructions in the right sidebar), since the count should be generated for each document assuming each document is a separate entity in your input data. Do you have the "create word vector" parameter checked?

The single counts are not exclusive of the n-gram, but the exclusive uses can be easily calculated via subtraction. So if there are 10 total occurrences of "united" and 6 occurrences of "united_states" then you know that 4 of the "united" occurrences were not associated with "united_states".

PatrickHou · January 2018

I found that stopwords(dictionary) can do the trick by manually add words I don't need after all in process documents. For a small case I'm doing it's enough but I'll still look for operators may deal with this problem.

Thank you.

sgenzer · January 2018

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

text mining and words counting problem

Answers