Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"Text Mining term occurrences per label value"
Hi everyone!
I started out using Rapidminer for text mining as it seems a pretty powerful tool to do so.
When using the "Process documents from data" operator I get an output called WordList which gives an overview of the different
words in the documents and a frequency of occurrence. I also set a label on the dataset and the table also shows the values
of this label as different categories for which it should give you term occurrence frequencies. However while
"document occurences" and "Total occurence" seem to be calculated correctly for every word, all the different categories just show 0 for every word.
I would expect a word like let's say "sponsor" which occurs in 10 documents to be distributed over the different categories since every document was classified
in a category.
Did I do something wrong in the data import process? Are there prerequisites I do not know about so the division of word occurrences would be shown correctly over all the values of the label
variable?
thanks in advance,
Arno
I started out using Rapidminer for text mining as it seems a pretty powerful tool to do so.
When using the "Process documents from data" operator I get an output called WordList which gives an overview of the different
words in the documents and a frequency of occurrence. I also set a label on the dataset and the table also shows the values
of this label as different categories for which it should give you term occurrence frequencies. However while
"document occurences" and "Total occurence" seem to be calculated correctly for every word, all the different categories just show 0 for every word.
I would expect a word like let's say "sponsor" which occurs in 10 documents to be distributed over the different categories since every document was classified
in a category.
Did I do something wrong in the data import process? Are there prerequisites I do not know about so the division of word occurrences would be shown correctly over all the values of the label
variable?
thanks in advance,
Arno
Tagged:
0
Answers
Either I do not get it or I simply cannot reproduce this. Could you please provide a minimal example which can be reproduced, i.e. the process and a small set of data which will be loaded. You can use the code-tags to paste the XML of the process and the data in CSV format for instance.
Cheers
Marcin
How do I add data to a forum post?and a screenshot? cause i get the image tags but can't upload an image?
The data is pretty simple though: just 1 variable filled with text and the other variable is a sentiment label (neutral, positive, negative).
data is like this: code:
The good news is that I could reproduce your issue. Even better, I know which operator is the reason for this. If you remove the "Extract Content" operator you will see correct values for the label. The bad news is that I am not sure why this happens and I am still investigating. It seems that it has something to do with some meta data this operator is adding to the document.
As a workaround you can remove the HTML-tags without using the "Extract Content" operator. This can be done by using the "Replace" operator which you have to insert before the "Process Documents From Data" operator. Use as the regular expression in the parameter "replace what" and leave "replace by" empty. This should remove all tags like the "Extract Content" operator did.
I hope this helps
Marcin
I will test your solution soon and give you feedback on it once I have it .
thanks again,
Arno