RapidMiner

WordList -> Document Operator?

SOLVED

WordList -> Document Operator?

I'm trying to batch process a large group of individual text files which I can then tokenize. I'm using the Text Processing operator group. I'm processing the files into a single WordList which I'm then trying to tokenize. Before I can tokenize I need to convert the WordList into a document - there doesn't appear to be a Generate Document operator as is being recommended to me by Quick Fix.

 

Any ideas?

 

Sorry for the beginner's question - I'm brand new to this.

 

Very respectfully,

Ben

1 REPLY
Highlighted
RM Staff
RM Staff
Solution

Re: WordList -> Document Operator?

Hi Ben,

 

No worries - we all started at some point Smiley Happy

 

The wordlist is actually the final result of the text processing operators, i.e. after you did all the necessary text processing like tokenization etc.  All those steps happen "inside" of the text processing operator (do you see the little icon in the bottom right corner of the operator? This indicates that this is an operator in which you can go "inside" with a double click).  

 

I think it is probably easier if you follow along one of the following videos (there are tons more if you search on Google):

 

https://rapidminer.com/resource/text-mining-rapidminer/

https://www.youtube.com/watch?v=6EyQ2TWYsVw

http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html

 

So what is the point of the wordlist then?  This makes sure that you use exactly the same words (and only those) for scoring than for training.  This is something which is actually kind of annoying in R for example which is why I really prefer to do text analytics in RapidMiner...

 

Cheers,

Ingo


How to load processes in XML from the forum into RapidMiner: Read this!
Part 1 of 2. This video discusses processing text in RapidMiner, including: - Tokenization - Replace token - Stemming - Filter stop words - Transform cases - Generate n-grams Automatic document classification is the task that assigns articles in documents based on its categories in a magazine.