Options

Arabic Light Stemming a CSV file

NoorKhalifaNoorKhalifa Member Posts: 7 Learner I
edited March 2023 in Help
I have a CSV file with around 4000 rows of text. I want to use the Arabic Light Stemmer to stem each record.

I have done the following but the text is not being stemmed. The output is the same as the input.


and inside the Process

Answers

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    To stem words, first you need words. Use Tokenize before Stem to split the text into words.

    Regards,

    Balázs
  • Options
    NoorKhalifaNoorKhalifa Member Posts: 7 Learner I
    @BalazsBarany

    I did the following



    inside the Process, but the output is still exactly the same as the input.

    Is there a problem with reading Arabic text?

    I specified the Encoding method to be UTF-8 when i imported the CSV file. Is there anything else I should do?
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    put a breakpoint after on Tokenize and play with the settings. If you see the words in different colors, the tokenization is working correctly.

    I have no idea about the conventions with Arabic text, maybe a different word separator is necessary etc.

    If the text looks normal to you in RapidMiner, then the encoding is correct. You would see that it is broken with a wrong encoding.

    Regards,

    Balázs
  • Options
    NoorKhalifaNoorKhalifa Member Posts: 7 Learner I
    @BalazsBarany

    I am facing this issue, what could be a possible reason?


  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    you need to use Nominal to Text before Process Documents in order to mark your nominal attributes as text (suitable for the Text Processing operators).

    Regards,
    Balázs
  • Options
    NoorKhalifaNoorKhalifa Member Posts: 7 Learner I
    @BalazsBarany

    When I put a break point after Stem, I can see the correctly stemmed sentence. But the final output in the Results is like the following. What can I do to fix this? I want the output to be rows of the stemmed sentence.


  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    Use the "keep text" option that all "Process Documents" operators have. 

    The default operation mode of Process Documents is to create the wide table suitable for machine learning methods. 

    Tokenization can split your text into letters, words or sentences. Stemming works on words, at least in Western languages. 

    Regards,
    Balázs
  • Options
    NoorKhalifaNoorKhalifa Member Posts: 7 Learner I
    @BalazsBarany

    Great, that solved it. But now, when i use Write CSV, I don't get Arabic text in the output CSV file.

    I set the encoding to UTF-8 for Read CSV, Write CSV, and the process when pressing on the white canvas.

    What can I do to solve that?


  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    Try using a software in which you can set the import encoding. Excel is not very smart when just opening a CSV file. Something with Import should also work in Excel, where you get a dialog for selecting the encoding.

    The encoding of text files is not obvious to most software. It often needs to be specified manually. You can use an advanced editor (GVim, Notepad++ etc.) to determine if the file itself is really in UTF-8.

    Regards,
    Balázs
  • Options
    jwpfaujwpfau Employee, Member Posts: 279 RM Engineering
    Hi Noor,

    Excel seem to have moved the CSV Import to Data → From Text/CSV



    Greetings,
    Jonas
  • Options
    NoorKhalifaNoorKhalifa Member Posts: 7 Learner I
    @jwpfau

    Hello!

    After clicking from Text/CSV, what should I do?


  • Options
    jwpfaujwpfau Employee, Member Posts: 279 RM Engineering
    Hi Noor, 

    For me the first dialog was the "Import Data" file selector, the second one the csv table preview from my screenshot.

    I fear the excel autodetection completely failed for your file, is there anything in the "Open As" menu that says csv or utf-8?

    Greetings,
    Jonas
  • Options
    NoorKhalifaNoorKhalifa Member Posts: 7 Learner I
    @jwpfau

    I didn't manage to do that in Excel, but importing the file in Notepad gave me the Arabic equivalent.

    Thanks!
  • Options
    jwpfaujwpfau Employee, Member Posts: 279 RM Engineering
    Hi Noor,

    you can force csv parsing here.



    But you will stay in the more cumbersome Power Query Editor flow afterwards.

    Greetings,
    Jonas
Sign In or Register to comment.