RapidMiner 9.8 Beta is now available
Be one of the first to get your hands on the new features. More details and downloads here:
"Dictionaries and stemming"
I'm currently using the text plug-in and I want to clarify a bit some of its peculiarities. I'm not using the block DictionaryStemmer and I'm simply working with the English stopword filter, the tokenizer and the Porter stemmer.
What I guessed is that:
1) my text is filtered against a set of English stop words and some words are pruned (Ex. and, or..).
- I have to work with texts on biology and so I'm wondering what happens with strange words such as IL-6. Are these words filtered or maintained?
2) The stemmer keeps only the "basic chunks" of my words. I think that this is based on a dictionary.
- Could you tell me which dictionary is that? I need to know that precisely in order to answer to the question "does it contain some medical terms such as glicolase..?" that is crucial for me now
- What does it happen to my strange word (Ex. IL-6)? Are they pruned, chunked in some way or kept as they are?
Thanks for your kind attention. Hope that someone can help!