"Text Mining and the Word List"

MartinLiebig · June 2016

Symptoms

Using Process Documents (from Data) you are able to generate a tokenized example set from a given set of documents. If you use one Process documents for your training and another for the testing you might get the error incompatible number of attributes if you apply the model.

Diagnosis

The problem probably that you did not transfer the word list from one Process Documents to the other. The wordlist contains mainly two information:

Which attribute to generate
The normalization

If you do not transfer the wordlist over, words which do not occur in your document won't create a attribute. In case of pruning different words will be deleted from your bag of words. Another effect is of course that even if you create the same attributes, your normalization (of TF/IDF) might be different.

Solution

Transfer over the wordlist created in your training stream over to the application stream. Thus you create

cici · January 2022

Sorry I don’t quite understand , how can I generate and transfer wordlist？？？Thank you for your reply

MartinLiebig · January 2022

by connecting the wor port of the upper process documents with the wor port of the lower one as shown in the screenshot.

cici · January 2022

Hello, I have referred to the icon connection operator you gave, but it still shows the actual attributes. Don’t know if I’m doing this right?

cici · January 2022

Thank u for your reply! I have followed just the connection you gave, but it still shows that the attributes do not match. I want to ask you another question. Since I already have a training set and a test set, why should I use validation?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Mining and the Word List"

Symptoms

Diagnosis

Solution

Comments