Options

StopwordfilterFile

nguyenxuanhaunguyenxuanhau Member Posts: 22 Contributor II
edited November 2018 in Help
Im using operator StopwordFilterFile but this operator don't work with many stop word as : với, ới, tời, đỗ

my file xml  as following:
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.6">

 <operator name="Root" class="Process" expanded="yes">
     <description text="Text Hau"/>
     <parameter key="logverbosity" value="init"/>
     <parameter key="random_seed" value="2001"/>
     <parameter key="send_mail" value="never"/>
     <parameter key="process_duration_for_mail" value="30"/>
     <parameter key="encoding" value="SYSTEM"/>
     <operator name="TextInput" class="TextInput" expanded="yes">
         <list key="texts">
           <parameter key="graphics" value="../../data/dulieu"/>
         </list>
         <parameter key="default_content_type" value=""/>
         <parameter key="default_content_encoding" value="UTF-8"/>
         <parameter key="default_content_language" value=""/>
         <parameter key="prune_below" value="1"/>
         <parameter key="prune_above" value="-1"/>
         <parameter key="vector_creation" value="TFIDF"/>
         <parameter key="use_content_attributes" value="false"/>
         <parameter key="use_given_word_list" value="false"/>
         <parameter key="return_word_list" value="false"/>
         <parameter key="id_attribute_type" value="short"/>
         <list key="namespaces">
         </list>
         <parameter key="create_text_visualizer" value="false"/>
         <parameter key="on_the_fly_pruning" value="-1"/>
         <parameter key="extend_exampleset" value="false"/>
         <operator name="StringTokenizer" class="StringTokenizer">
         </operator>
         <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
         </operator>
         <operator name="StopwordFilterFile" class="StopwordFilterFile">
             <parameter key="file" value="../../data/dulieu/stopword/stopword.dat"/>
             <parameter key="case_sensitive" value="false"/>
         </operator>
     </operator>
 </operator>

</process>
The stopword file contains stop words one per line.
to use operator StopwordFilterFile, what do i do?

Greetings!

Answers

  • Options
    haddockhaddock Member Posts: 849 Maven
    Hi there,

    Thanks for posting the process, however most folks now use version 5 and will not be able to load it. Upgrade to commune!

    As to your problem, my guess is that it is about the characters in those words, and whether their encoding is correctly set, both in Rapidminer and in the stopword file ( I notice you use both windows-1252 and UTF-8 in your Rapidminer XML ). There are also problems specific to Vietnamese detailed here http://vietunicode.sourceforge.net/main.html . Obviously if letters are differently portrayed texts will not match, but if they are portrayed using the same format throughout then I'd need to look into the source.

    Which I don't have, because the Text plugin has also been updated!

Sign In or Register to comment.