Options

Why UTF-8 is not working?

heron_oliveiraheron_oliveira Member Posts: 6 Newbie
Today I converted a pdf to txt, and I'm trying to analyse some therms frequency in the text. Despite the txt is in UTF-8 and I've already changed the program's encoding into the default (SYSTEM) or into 'UTF-8' before tokenizing, generating n_grams, it keeps showing incorrect words. For example, the word should've been 'abrangência' inetead of 'abrangãºncia'.

Best Answer

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,512 RM Data Scientist
    Solution Accepted
    Hi there,
    what operator do you use to read the text file? It should have a setting as well.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • Options
    heron_oliveiraheron_oliveira Member Posts: 6 Newbie
    My txt file has correct words, it only happends when I run operators in RapidMiner. And I'm using operators for tokenizing, Transform Cases, Generate n-Grams, Filter Tokens and Filter StopWords. But the problem begins since the first operator wich is Tokenize...
  • Options
    heron_oliveiraheron_oliveira Member Posts: 6 Newbie
    I would also like to know how to must be the stop words list format. Since there is no Portuguese stop words operator, I made a list document, but I don't know if it accepts list format or if it should be dictionary or something else.
  • Options
    heron_oliveiraheron_oliveira Member Posts: 6 Newbie
    Exactly, I was changing enconde in the settings > preferences. But in fact I should've done it on the operator settings. Thanks!
Sign In or Register to comment.