Search for Keywords
Hello community,
I am currently doing my masters degree and in one of our courses me and my group have to work on a project with rapidminer. We have no background in programming and this is the first time we are working with rapidminer. Our task is do create a textmining tool that crawls a list of excel-files and in a first step enables us to search for a list of keywords. We then need to know wether the texts contain those keywords or not. We would also like to know how often a keyword appears in those texts.
We tried using the following operators:
1. SelectAttributes
2. Filter documents (by content) (we created a loop that goes through the excel-file and wrote every text in a separate document)
3. FilterExamples
However we don’t really know how to use those operators because everything we’ve tried (playing with the different options of the operators) didn’t work out.
Another thing we thought about is to create a cut-set of the texts and the keywordlist and see which elements the two files have in common (but again we don’t know how to implement this).
Are we heading towards the right direction or do you have any tips how we should start?
I hope you can help us
Cheers
Tim
Best Answer
-
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data ScientistHi,i think you can jsut do process Documents with a sole tokenize and afterwards filter on the column of what you search?Best,Martin- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany6
Answers
Thank you for your answer, we'll try that.
right now we have two streams, the first one is reading the excel list with the texts and the second one is reading the one with the keywords. Then we used Process Documents with tokenize for both paths (1.Read excel 2.Nominal to text 3.Process Documents form Data 4.Data to Documents-Process Documents). We see all the words and in which row they occur, but is it possible to filter the results or change the setting so that we only see the keywords? This would be a lot more convenient because you wouldn't have to look through all the words.