Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Why Tokenize component not show any list of words?

research_moghimresearch_moghim Member Posts: 2 Learner III
edited January 2020 in Help

I am working on a persian classification project. Persian texts is very similar to arabic texts. when I use Tokenize, it does not show any word in its wordlist page and in Example Set Page, The Image below will be shown:

01.pngOutput of Process Documents from Data

I need to classify persian text to some category, but I dont know why.

I Follow some steps like this: 

1- Read Excel(using Read Excel component) dataset with 2 column  : col1 : persian Text ,col2: Category

2- I use Set role component to labeling data

3- I use Process Documents from Data component containing Tokenize(with any mode not change anythings) and Filter Token(min:5,max:25) inside it

4- Then I use Cross Validation Component to train with SVM or Basian and in test mode to get performance. 

The program runs correctly and performance is not bad for e.g accuracy is 50% but I think my work is Wrong.

Any help would be appreciated.

 

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @research_moghim - so I'm not sure many people here (including myself) have much experience with Persian text. :( I will say that, in general, this is often an encoding issue. Perhaps try that?


    Scott

     

  • rfuentealbarfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hi @research_moghim,

     

    Do you mind to publish the XML for your process? Although I don't have experience with Persian text, I do have experience with different encodings and language issues. Perhaps we can see what's happening there and see if there is anything we can help with.

     

  • neginzneginz Member Posts: 17 Maven
    research_moghim
    U Should set the encoding to UTF-8
Sign In or Register to comment.