Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Using HIndi Language in Tokenizer
Hi,
My documents to be analyzed are in Hindi. The encoding format is UTF-8. For creating the word Vector I have used WVplugin. The problem is that I am not getting all the tokens (I used all the tokenizers in rapidminer 4.6), in fact i am getting too low - 4 to be precise ???
I changed the content language and encoding to Hindi and UTF, but without any sucess - is there any additional setup to be done to tokenize the text properly?
~alabiit
My documents to be analyzed are in Hindi. The encoding format is UTF-8. For creating the word Vector I have used WVplugin. The problem is that I am not getting all the tokens (I used all the tokenizers in rapidminer 4.6), in fact i am getting too low - 4 to be precise ???
I changed the content language and encoding to Hindi and UTF, but without any sucess - is there any additional setup to be done to tokenize the text properly?
~alabiit
0
Answers
from RapidMiner 5.0 on, you can configure the tokenizer more detailed. You can enter arbitrary split characters so that it should work with any language that splits its words with a character at all.
Greetings,
Sebastian
Thanks. But upgrading is always tough job .
will check out.
~alabiit