Arabic words recognition

shk721shk721 Member Posts: 9 Contributor II
edited November 2018 in Help
I was wondering if someone could solve the encoding problem for Arabic language . Basically , by choosing the right encoding forma t in the content_encoding _parameter  the system displays the Arabic word correctly in the result view . However , two problem raised :

1. The message viewer when I apply a model displays the words as "?????" .
2. The wordlist produced also consists of question marks instead of words.
3. When I try to use StopWordFilter  , I discovered that the system isn't able to match Arabic  to filter .



Thanks in advance;
Hassan

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hello Hassan,

    did you try to also define the encoding in the main process operator (root)? Maybe this helps.

    About the stop words: RM currently does not support a stop word filter for arabic words but you could simply create one with the file based stop word filter (don't remember the exact name right now).

    Cheers,
    Ingo
  • shk721shk721 Member Posts: 9 Contributor II
    Hi Ingo;


    Thanks for your prompt response .

    actually , I have defined the encoding in the root process and in the preference and it didn't work . However, i want to know if there is an

    enhancement of output  encoding in Rapidminer because as i said in the beginning , the reading process of the input data was perfect .

    i am looking for  your help to resolve this problem. 


    cheers ;
    Hassan
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hello Hassan,

    hmm, that's sort of weird. I must admit that we do not have any experience with Arabic characters but we know that the output should also work for Chinese characters so I assume there is no principal problem with this. Could you provide us some texts so we could try to find out what's going on?

    Thanks and cheers,
    Ingo
  • shk721shk721 Member Posts: 9 Contributor II
    HI Ingo ;

    i been waiting for your response .

    this sample of arabic texts:


    ان الرهن العقاري ذا الأصول الإسلامية، عُمل فيه بطرق موسعة وناجحة بكل المقاييس، في الدول الأجنبية، ونقل هذا النظام عن طريق باحثين تخصصوا في الرهن العقاري، إلى دول إسلامية مثل ماليزيا، وسنغافورة، وكذلك البحرين ودبي.


    i appreciate your reaction , and  i really need to sort this out . Also, to keep informed about the probelm  , it is in writing with the program give the
    feedback. it looks direct with default encoding not with the specified encoding .


    i am eagrly awaited  to hear form you , because i need to sort it out  to start my disertation .



    Cheers;
    Hassan
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hello,

    I must admit that I was not even able to properly work with the test sample since I had no program available which was able to display it. I wanted to create a small data file containing some of the words together with an .aml file describing the data in order to work with that but I didn't manage to get create those files - at least I was not able to see anything and I am assuming I lost the information about the characters somewhere in this process.

    My suggestion: please create a date file together with an .aml file which I can directly load with the ExampleSource operator. Please also specify the encoding in the .aml file and attach both files together with the information about the correct encodig here. Maybe then I am able to sort out what's happening in the output.

    Cheers,
    Ingo
  • kochankochan Member Posts: 11 Contributor II
    I just came across this thread which is one year old. I have had the same problem with languages written in non-latin scripts. I resolved the problems that I had in the development version by fixed the text plugin code as I have described here:

    https://sourceforge.net/tracker/?func=detail&aid=2724678&group_id=131810&atid=722307

    If you need to save a modelfile as something other than binary, then changes also have to be made to the ModelWriter operator.

    Regards,

    Andreas
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Andreas,
    thank you for this hint. Since we are not faced with non latin text in the usual day work, we weren't fully aware of this. But we will keep this in mind, while revising the text plugin for the next major version of rapidMiner.

    Greetings,
      Sebastian
  • drstevekramerdrstevekramer Member Posts: 7 Contributor II
    Andreas and Sebastian, is it possible to use the getContentEncoding approach with the separate wvtool Java library, which I am using (rather than the Text plug-in)? I am having the same problem of Arabic text being displayed as question marks, even though I specified utf-8 as the encoding when creating the WVTDocumentInfo.

    Thanks,
    Steve
Sign In or Register to comment.