Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Text mining in utf-8

i_anickai_anicka Member Posts: 2 Contributor I
edited November 2018 in Help

Hello all,

 

I need to use RapidMiner for text mining in Cyrilic.
I tried setting the encoding to utf-8. It gives me some results which are displayed in characters instead of cyrilic words.

 

Thanks,

 

 

 

Best Answer

  • i_anickai_anicka Member Posts: 2 Contributor I
    Solution Accepted

    Hi guys,

     

    I have solved my problem.

    I had set the utf-8 encoding everywhere except on the process level.

    I changed this and it works!

     

    Thank you all for your replies.

     

    Ana,

     

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,528 RM Data Scientist

    Hi,

    could you maybe post an example?
    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    It could be that your original document isn't in UTF-8, but in another encoding. 

    One way to be absolutely sure is to create a loop which changes the encoding parameter in your process documents using macros and to look at all the resulting outputs.  The one that looks 'right'. 

     

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    agreed.  Just did a quick check and there's no problem with Cyrillic in UTF-8.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.3.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
    <parameter key="text" value="Для поиска нажмите Ввод"/>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="34">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

  • arunasethupathyarunasethupathy Member Posts: 4 Contributor I

    I want to use Tamil language for text mining

    Where you have change the UTF-8 option for this

    I have tried in process level but unable to get

    Plz anybody give the answer

  • arunasethupathyarunasethupathy Member Posts: 4 Contributor I

    for changing the unicode option to UTF-8 ( for processing tamil language)

    I have changed in the Rapidminer studio preference - encoding to UTF-8

    I have simply read the document using ReadDocument operator in Text mining extension

    But it is not working, the screen shot is attached ( doc7.docx)

    Kindly help me to sort out this problem

    Tahnk you

     

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hello @arunasethupathy - so Tamil is not a language I have worked with before.  Could you please post your XML process AND your text document (in Tamil) so I can take a look?

     

    Thank you.

     

    Scott

     

     

  • arunasethupathyarunasethupathy Member Posts: 4 Contributor I

    Sir,

    Kindly find the attached for the sample tamil text document

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    thank you @arunasethupathy.  Can you please also post your XML process?

     

    Scott

     

     

Sign In or Register to comment.