Options

"Filter Stopwords (Dictionary) -- Unicode support?"

pleonardpleonard Member Posts: 4 Contributor I
edited May 2019 in Help
Hi there, I'm having good luck with the Filter Stopwords (Dictionary) in creating a stoplist for Danish, but am finding that a-ring (å) is not obeyed. I've confirmed the file is in utf-8, as are my source texts, and that the linefeeds are correct. Other stopwords, that do not include non-ascii, are being filtered correctly. Anyone come across this before?

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,

    no I haven't but the number of danish text's I processes is close to zero :) Did you make sure that RapidMiner opens the text file in the UTF-8 encoding?

    Anyway: If you have  good stopword file for danish, would you like to contribute it? We could include it into core...

    Greetings,
      Sebastian
  • Options
    pleonardpleonard Member Posts: 4 Contributor I
    OK, I've confirmed this is a bug, I think. Let's move to German because that is a more common language:

    Set these two things:

    1) rapidminer.general.encoding to UTF-8
    2) Process Documents from Files to UTF-8

    Ensure both your text and stoplist are in UTF-8.

    Text: schloß means castle.
    Stoplist: schloß castle

    Result: schloß means

    This is with RapidMiner 5.1.001 on MacOS X 10.6.  Surely there must be people from Germany working with this who have noticed this problem before -- or a trick to get around it?

    Thanks!
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    i have added a parameter for choosing the encoding of the dictionary. This will be made available with the next TextExtension release. But it's uncertain when this will be.

    Greetings,
      Sebastian
  • Options
    pleonardpleonard Member Posts: 4 Contributor I
    Thanks! If you have any need of a beta-tester (I work with large Swedish, Danish and Norwegian texts) please let me know and I'd be glad to help out...
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    we are currently working on a completely new Text Extension that will go beyond everything the old one was able to do. We will document our progress in our Special Interest Group for Text Mining. If you want to participate, you are very welcome. I just need your email in a PM to put you on the list.

    Greetings,
      Sebastian
Sign In or Register to comment.