Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"Dictionary Spanish (text mining)"

ronel74ronel74 Member Posts: 2 Contributor I
edited June 2019 in Help
Hi, I recently started to use rapidminer and I am having troubles with some operators regarding text processing, because the language that I am working with is spanish.

The operators that I would like to use are:

Stemming
tokenize linguistic
filter stopwords

Are these operators available for spanish texts. ??

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,531 RM Data Scientist
    The snowball stemming supports spanish
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • ClaraCabaClaraCaba Member Posts: 9 Contributor II
    Still no Filter Stopwords available in Spanish though, right? :(

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Actually there are Spanish stopwords you can download from the internet and add to your process using the Filter Stopwords (Dictionary). 
    Just follow the operator documentation and create a file with one Spanish word per line and use that. 

    Here's a short example using the stopwords listed here: http://www.ranks.nl/stopwords/spanish
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Root">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="1969"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <parameter key="parallelize_main_process" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="7.0.000" expanded="true" height="68" name="Spanish Stopwords" width="90" x="45" y="187">
            <parameter key="text" value="un&#10;una&#10;unas&#10;unos&#10;uno&#10;sobre&#10;todo&#10;también&#10;tras&#10;otro&#10;algún&#10;alguno&#10;alguna&#10;algunos&#10;algunas&#10;ser&#10;es&#10;soy&#10;eres&#10;somos&#10;sois&#10;estoy&#10;esta&#10;estamos&#10;estais&#10;estan&#10;como&#10;en&#10;para&#10;atras&#10;porque&#10;por qué&#10;estado&#10;estaba&#10;ante&#10;antes&#10;siendo&#10;ambos&#10;pero&#10;por&#10;poder&#10;puede&#10;puedo&#10;podemos&#10;podeis&#10;pueden&#10;fui&#10;fue&#10;fuimos&#10;fueron&#10;hacer&#10;hago&#10;hace&#10;hacemos&#10;haceis&#10;hacen&#10;cada&#10;fin&#10;incluso&#10;primero&#10;desde&#10;conseguir&#10;consigo&#10;consigue&#10;consigues&#10;conseguimos&#10;consiguen&#10;ir&#10;voy&#10;va&#10;vamos&#10;vais&#10;van&#10;vaya&#10;gueno&#10;ha&#10;tener&#10;tengo&#10;tiene&#10;tenemos&#10;teneis&#10;tienen&#10;el&#10;la&#10;lo&#10;las&#10;los&#10;su&#10;aqui&#10;mio&#10;tuyo&#10;ellos&#10;ellas&#10;nos&#10;nosotros&#10;vosotros&#10;vosotras&#10;si&#10;dentro&#10;solo&#10;solamente&#10;saber&#10;sabes&#10;sabe&#10;sabemos&#10;sabeis&#10;saben&#10;ultimo&#10;largo&#10;bastante&#10;haces&#10;muchos&#10;aquellos&#10;aquellas&#10;sus&#10;entonces&#10;tiempo&#10;verdad&#10;verdadero&#10;verdadera&#10;cierto&#10;ciertos&#10;cierta&#10;ciertas&#10;intentar&#10;intento&#10;intenta&#10;intentas&#10;intentamos&#10;intentais&#10;intentan&#10;dos&#10;bajo&#10;arriba&#10;encima&#10;usar&#10;uso&#10;usas&#10;usa&#10;usamos&#10;usais&#10;usan&#10;emplear&#10;empleo&#10;empleas&#10;emplean&#10;ampleamos&#10;empleais&#10;valor&#10;muy&#10;era&#10;eras&#10;eramos&#10;eran&#10;modo&#10;bien&#10;cual&#10;cuando&#10;donde&#10;mientras&#10;quien&#10;con&#10;entre&#10;sin&#10;trabajo&#10;trabajar&#10;trabajas&#10;trabaja&#10;trabajamos&#10;trabajais&#10;trabajan&#10;podria&#10;podrias&#10;podriamos&#10;podrian&#10;podriais&#10;yo&#10;aquel"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:write_document" compatibility="7.0.000" expanded="true" height="82" name="Create a file of these words" width="90" x="179" y="187">
            <parameter key="overwrite" value="true"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:read_document" compatibility="7.0.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
            <parameter key="file" value="myFile.txt"/>
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.0.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="380" y="34">
            <parameter key="case_sensitive" value="false"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <connect from_op="Spanish Stopwords" from_port="output" to_op="Create a file of these words" to_port="document"/>
          <connect from_op="Create a file of these words" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • ClaraCabaClaraCaba Member Posts: 9 Contributor II
    Thank you very much, I did that and it worked perfectly.
Sign In or Register to comment.