Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
[SOLVED]RapidMiner Sentiment Analysis Problem
Hi,
I have a RapidMiner studio process that trains a liner SVM using positive and negative product reviews. The training part works ok upto performance calculation. However, when I Apply my model on unseen unlabeled data, I am getting the error:
Problem occured. The input ExampleSet does not match the training ExampleSet. Missing attribute: 'aaaahh'. The operator expects the input ExampleSet to have a set of attributes which is equal or a superset of the ExampleSet used for training of the input model. Please make sure that the attributes of the two ExampleSets satisfy this condition. This beats me, because what is happening here is that during training, I am using the Process Documents from Data operator to tokenize my text, similarly I do the same to the unlabelled data just before passing it through to the model. Considering that the training and testing ExampleSet will contain different words and phrases, and that these words are turned into attributes by the Process Documents operator, I cannot understand why the Apply model operator thinks that the attributes in training example set should match the attributes in the testing set should match, hence its expection to find the word 'aaaahh' also in the training set. Could anyone point me in the right direction please. (technically I can see why this is happening but it seems that it is illogical, so I must have done something wrong with my process design)
Unfortunately I cannot embed the code as my message would exceed the 20k character limit.
Thanks
I have a RapidMiner studio process that trains a liner SVM using positive and negative product reviews. The training part works ok upto performance calculation. However, when I Apply my model on unseen unlabeled data, I am getting the error:
Problem occured. The input ExampleSet does not match the training ExampleSet. Missing attribute: 'aaaahh'. The operator expects the input ExampleSet to have a set of attributes which is equal or a superset of the ExampleSet used for training of the input model. Please make sure that the attributes of the two ExampleSets satisfy this condition. This beats me, because what is happening here is that during training, I am using the Process Documents from Data operator to tokenize my text, similarly I do the same to the unlabelled data just before passing it through to the model. Considering that the training and testing ExampleSet will contain different words and phrases, and that these words are turned into attributes by the Process Documents operator, I cannot understand why the Apply model operator thinks that the attributes in training example set should match the attributes in the testing set should match, hence its expection to find the word 'aaaahh' also in the training set. Could anyone point me in the right direction please. (technically I can see why this is happening but it seems that it is illogical, so I must have done something wrong with my process design)
Unfortunately I cannot embed the code as my message would exceed the 20k character limit.
Thanks
Tagged:
0
Answers
have you added the word list from training to the process documents operator from applying?
~Martin
Dortmund, Germany
Why is this marked as [SOLVED]? Is the one reply the correct answer? I'm having the same problem.
Please have a look at this KB article:
http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Text-Mining-and-the-Word-List/ta-p/31723
~Martin
Dortmund, Germany
To build on what @mschmitz's knowledge base post, once you do your text transformations (i.e. tokenize, filter stop words, etc) in the Process Documents operator, many words will be stripped out of the corpus (i.e. the, a, lol). The TDIDF values of the remaining words get passed downstream via the EXA port to your machine learning algorithm. It will have "X" columns.
The problem comes in when your testing set gets processes and there are "X + n" columns to apply your model too, then the process breaks. Hence the passing the Wordlist from WOR port to the testing set. This way only the columns you trained your model one will be selected for the testing set.