set categories by finding words in a document
Hello everyone,
I am new to Rapidminer but enjoying the ride so far. I am stuck with a couple of issues..
First, I have a set of 3 categories, each one is defined by 5 words.. meaning that if a document has those 5 words in its corpus then I would like to assign that document to that particular category.
In other words, I would like to go through my dataset, search the corpus for the 5 words of each category and associate the document to the category in which it finds all 5 words.
Is there a way to do that in Rapidminer?
Cheers,
D
I am new to Rapidminer but enjoying the ride so far. I am stuck with a couple of issues..
First, I have a set of 3 categories, each one is defined by 5 words.. meaning that if a document has those 5 words in its corpus then I would like to assign that document to that particular category.
In other words, I would like to go through my dataset, search the corpus for the 5 words of each category and associate the document to the category in which it finds all 5 words.
Is there a way to do that in Rapidminer?
Cheers,
D
Tagged:
0
Answers
you should use the Text Processing extension to tokenize your documents. You end up with an example set which contains the documents as rows and the tokens as columns. If the value of a column is greater than 0 in a row it means that the word appeared in the corresponding document. You can then use Generate Attributes to create a new attribute by checking if the 5 words are present and writing the result to the new attribute. Change the vector_creation parameter of your process documents to Binary Term Occurrences. Have a look at the attached process.
Best,
Marius