Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
set categories by finding words in a document
Hello everyone,
I am new to Rapidminer but enjoying the ride so far. I am stuck with a couple of issues..
First, I have a set of 3 categories, each one is defined by 5 words.. meaning that if a document has those 5 words in its corpus then I would like to assign that document to that particular category.
In other words, I would like to go through my dataset, search the corpus for the 5 words of each category and associate the document to the category in which it finds all 5 words.
Is there a way to do that in Rapidminer?
Cheers,
D
I am new to Rapidminer but enjoying the ride so far. I am stuck with a couple of issues..
First, I have a set of 3 categories, each one is defined by 5 words.. meaning that if a document has those 5 words in its corpus then I would like to assign that document to that particular category.
In other words, I would like to go through my dataset, search the corpus for the 5 words of each category and associate the document to the category in which it finds all 5 words.
Is there a way to do that in Rapidminer?
Cheers,
D
Tagged:
0
Answers
you should use the Text Processing extension to tokenize your documents. You end up with an example set which contains the documents as rows and the tokens as columns. If the value of a column is greater than 0 in a row it means that the word appeared in the corresponding document. You can then use Generate Attributes to create a new attribute by checking if the 5 words are present and writing the result to the new attribute. Change the vector_creation parameter of your process documents to Binary Term Occurrences. Have a look at the attached process.
Best,
Marius