"How to filter out a text so as to keep only words given in a list of words"

barthosbarthos Member Posts: 20 Contributor II
edited May 2019 in Help
Hello,

I would like to filter out a text so that the operator keeps only the words of the text that are present in a list (provided) (or equally remove all the words that are not in the list). Ideally, the Stopword by dictionnary with an option "invert selection" would be perfect.
As a sided question, I would like to know the purpose of the entry "wor" (I guess it means word) in the Process_Document_from_Data operator.

Thanks,
Barthélémy

Answers

  • colocolo Member Posts: 236 Maven
    Hi Barthélémy,

    when I read your post I remembered a similar question posted some time ago. You can find it here: http://rapid-i.com/rapidforum/index.php/topic,3493.0.html (did you even search for it?  ;)) But don't expect a fully satisfying solution there. I don't know if the developers have something new at hand today...

    What entry "wor" do you mean? The input port of the operator??

    Regards
    Matthias
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    if you have a word list and want to count only words that are in this word list, you simply can forward the word list to the "wor" input port of the process documents operator. Only then it is assured that for new texts the representation remains the same as during the training! If you don't do this the set of words can differ and the TF-IDF calculation will be different.

    If you need to filter the text for having the text filtered and not a filtered TF-IDF representation, then there's unfortunately no way until now. You could raise a feature request in our bugtracker for that.

    With kind regards,
    Sebastian
  • barthosbarthos Member Posts: 20 Contributor II
    Thanks a lot !
    However, I've tried to make a list of words to pass to the entry "wor" but it looks like I haven't find the way to do it. Is there a special operator to tranform documents or example set into a list of words?
    Thanks again,
    Barthélémy
Sign In or Register to comment.