Why UTF-8 is not working?

heron_oliveira · March 6

Today I converted a pdf to txt, and I'm trying to analyse some therms frequency in the text. Despite the txt is in UTF-8 and I've already changed the program's encoding into the default (SYSTEM) or into 'UTF-8' before tokenizing, generating n_grams, it keeps showing incorrect words. For example, the word should've been 'abrangência' inetead of 'abrangãºncia'.

MartinLiebig · March 6

Hi there,

what operator do you use to read the text file? It should have a setting as well.

Cheers,

Martin

heron_oliveira · March 6

My txt file has correct words, it only happends when I run operators in RapidMiner. And I'm using operators for tokenizing, Transform Cases, Generate n-Grams, Filter Tokens and Filter StopWords. But the problem begins since the first operator wich is Tokenize...

heron_oliveira · March 6

I would also like to know how to must be the stop words list format. Since there is no Portuguese stop words operator, I made a list document, but I don't know if it accepts list format or if it should be dictionary or something else.

heron_oliveira · March 6

Exactly, I was changing enconde in the settings > preferences. But in fact I should've done it on the operator settings. Thanks!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Why UTF-8 is not working?

Best Answer

Answers