Why is Stem (Dictionary) not working?

roger_rutishausroger_rutishaus Member Posts: 8 Contributor I
edited December 2018 in Product Feedback - Resolved
Hi,

I use "Stem (Dictionary)", to which i connected "Open File" that loads a .txt file.

In the txt file are the entries, like:

jugendlich:jugendlich jugendliche jugendlichem jugendlichen jugendlicher jugendliches 
jugendpflegerisch:jugendpflegerisch jugendpflegerische jugendpflegerischem jugendpflegerischen jugendpflegerischer jugendpflegerisches 
jugoslawisch:jugoslawisch jugoslawische jugoslawischem jugoslawischen jugoslawischer jugoslawisches 
jung:jung junge jungem jungen 

The stemmer does not work. The wordlist results still delivers "jugendlichen" instead of "jugendlich".
What am I doing wrong? Thanks for your help!

Roger


complete settings:
<div class="Spoiler"><pre class="CodeBlock"><?xml version="1.0" encoding="UTF-8"?><process version="9.0.003"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files (2)" width="90" x="45" y="34"> <parameter key="directory" value="D:\Dropbox\_BT\Textanalyse\_Quelle\Korpus\Multimediaproduktion\Web"/> <parameter key="recursive" value="true"/> <process expanded="true"> <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="34"> <parameter key="extract_text_only" value="false"/> <parameter key="content_type" value="html"/> <parameter key="encoding" value="UTF-8"/> </operator> <connect from_port="file object" to_op="Read Document" to_port="file"/> <connect from_op="Read Document" from_port="output" to_port="output 1"/> <portSpacing port="source_file object" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">HTML-Dateien</description> </operator> <operator activated="true" class="loop_collection" compatibility="9.0.003" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34"> <process expanded="true"> <operator activated="true" class="text:html_to_xml" compatibility="8.1.000" expanded="true" height="68" name="HTML to XML" width="90" x="45" y="34"/> <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="179" y="34"> <parameter key="query_type" value="XPath"/> <list key="string_machting_queries"/> <list key="regular_expression_queries"> <parameter key="body" value="&lt;body.*[\s\S]+&lt;/body&gt;"/> </list> <list key="regular_region_queries"> <parameter key="body" value="&lt;body\.*&gt;.&lt;\\/body&gt;"/> </list> <list key="xpath_queries"> <parameter key="inhalt_html-dokumente" value="//h:div[@id=&quot;content_center&quot;]//h:div[@class=&quot;conttext&quot;][text()]"/> </list> <list key="namespaces"/> <list key="index_queries"/> <list key="jsonpath_queries"/> <process expanded="true"> <operator activated="true" class="web:extract_html_text_content" compatibility="9.0.000" expanded="true" height="68" name="Extract Content (2)" width="90" x="112" y="34"> <parameter key="minimum_text_block_length" value="6"/> </operator> <operator activated="true" class="text:filter_documents_by_content" compatibility="8.1.000" expanded="true" height="82" name="Filter Documents (by Content)" width="90" x="246" y="34"> <parameter key="condition" value="contains match"/> <parameter key="regular_expression" value="."/> </operator> <connect from_port="segment" to_op="Extract Content (2)" to_port="document"/> <connect from_op="Extract Content (2)" from_port="document" to_op="Filter Documents (by Content)" to_port="documents 1"/> <connect from_op="Filter Documents (by Content)" from_port="documents" to_port="document 1"/> <portSpacing port="source_segment" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <connect from_port="single" to_op="HTML to XML" to_port="document"/> <connect from_op="HTML to XML" from_port="document" to_op="Cut Document" to_port="document"/> <connect from_op="Cut Document" from_port="documents" to_port="output 1"/> <portSpacing port="source_single" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">Nur relevanter Text behalten</description> </operator> <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files (3)" width="90" x="45" y="187"> <parameter key="directory" value="D:\Dropbox\_BT\Textanalyse\_Quelle\Korpus\Multimediaproduktion\Projekt"/> <parameter key="recursive" value="true"/> <process expanded="true"> <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document (2)" width="90" x="112" y="34"> <parameter key="encoding" value="UTF-8"/> </operator> <connect from_port="file object" to_op="Read Document (2)" to_port="file"/> <connect from_op="Read Document (2)" from_port="output" to_port="output 1"/> <portSpacing port="source_file object" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">TXT-Dateien</description> </operator> <operator activated="true" class="collect" compatibility="9.0.003" expanded="true" height="103" name="Collect (2)" width="90" x="313" y="136"> <description align="center" color="transparent" colored="false" width="126">Quelldokumente sammeln</description> </operator> <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents (2)" width="90" x="447" y="136"> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="absolute"/> <parameter key="prune_below_absolute" value="2"/> <parameter key="prune_above_absolute" value="99999"/> <process expanded="true"> <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="34"> <parameter key="mode" value="regular expression"/> <parameter key="characters" value=" "/> <parameter key="expression" value="((-[^a-zA-Z])+)|(([^a-zA-Z]{1,}-)+)|([^a-zA-Zäöü0-9-]+)"/> <parameter key="language" value="German"/> </operator> <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/> <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="313" y="34"> <parameter key="min_chars" value="3"/> <parameter key="max_chars" value="100"/> </operator> <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="179" y="136"> <parameter key="file" value="D:\Dropbox\_BT\Textanalyse\_RapidMiner Tools\stopwords-de-solariz-small.txt"/> <parameter key="encoding" value="UTF-8"/> </operator> <operator activated="true" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="313" y="136"/> <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="136"/> <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="313" y="238"> <parameter key="condition" value="contains match"/> <parameter key="string" value="^[0-9]"/> <parameter key="regular_expression" value="^[^0-9].*"/> </operator> <operator activated="false" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="238"> <parameter key="max_length" value="3"/> </operator> <operator activated="false" class="text:filter_tokens_by_pos" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="514" y="340"> <parameter key="language" value="German"/> <parameter key="expression" value="NE"/> <parameter key="invert_filter" value="true"/> </operator> <operator activated="false" class="text:stem_german" compatibility="8.1.000" expanded="true" height="68" name="Stem (German)" width="90" x="447" y="493"/> <operator activated="true" class="open_file" compatibility="9.0.003" expanded="true" height="68" name="Open File" width="90" x="112" y="544"> <parameter key="filename" value="D:\Dropbox\_BT\Textanalyse\_RapidMiner Tools\rogerwordlist3.txt"/> </operator> <operator activated="true" class="text:stem_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Stem (Dictionary)" width="90" x="246" y="442"/> <operator activated="true" class="text:extract_token_number" compatibility="8.1.000" expanded="true" height="68" name="Extract Token Number" width="90" x="648" y="34"/> <connect from_port="document" to_op="Tokenize (2)" to_port="document"/> <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases" to_port="document"/> <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (2)" to_port="document"/> <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/> <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/> <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/> <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Stem (Dictionary)" to_port="document"/> <connect from_op="Open File" from_port="file" to_op="Stem (Dictionary)" to_port="file"/> <connect from_op="Stem (Dictionary)" from_port="document" to_op="Extract Token Number" to_port="document"/> <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> <description align="center" color="transparent" colored="false" width="126">Dokumente verarbeiten</description> </operator> <operator activated="true" class="write_excel" compatibility="9.0.003" expanded="true" height="82" name="Write Excel (2)" width="90" x="514" y="34"> <parameter key="excel_file" value="D:\Dropbox\_BT\Textanalyse\terms-multimediaprod.xlsx"/> <parameter key="number_format" value="#.000"/> </operator> <operator activated="false" class="text:process_documents" compatibility="8.1.000" expanded="true" height="82" name="Process Documents" width="90" x="246" y="595"> <process expanded="true"> <connect from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <connect from_op="Loop Files (2)" from_port="output 1" to_op="Loop Collection" to_port="collection"/> <connect from_op="Loop Collection" from_port="output 1" to_op="Collect (2)" to_port="input 1"/> <connect from_op="Loop Files (3)" from_port="output 1" to_op="Collect (2)" to_port="input 2"/> <connect from_op="Collect (2)" from_port="collection" to_op="Process Documents (2)" to_port="documents 1"/> <connect from_op="Process Documents (2)" from_port="example set" to_op="Write Excel (2)" to_port="input"/> <connect from_op="Process Documents (2)" from_port="word list" to_port="result 2"/> <connect from_op="Write Excel (2)" from_port="through" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process></pre></div>
0
0 votes

Declined · Last Updated

No activity or votes since December 2018. Please comment and cc sgenzer if this should be reopened. IC-1170

Comments

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    This may be a bug, I will let the developers speak to that.
    But in the meantime, if you want a workaround you can try the Stem Tokens Using Exampleset operator, which allows you to put your desired stemming into a normal dataset.  This operator is part of the free Operator Toolbox extension.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @roger_rutishaus - can you please provide the txt file and maybe a simpler process so I can reproduce?

    Scott

  • roger_rutishausroger_rutishaus Member Posts: 8 Contributor I
    thank yout both for your answers!

    @Telcontar120
    i don't know what process you mean (no process found by the name of "stem tokens")

    @sgenzer
    stemming file is attached.
    new, simplyfied process is as follows:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files (3)" width="90" x="45" y="187">
            <parameter key="directory" value="D:\Dropbox\_BT\Textanalyse\_Quelle\Korpus\Multimediaproduktion\Projekt"/>
            <parameter key="recursive" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document (2)" width="90" x="112" y="34">
                <parameter key="encoding" value="UTF-8"/>
              </operator>
              <connect from_port="file object" to_op="Read Document (2)" to_port="file"/>
              <connect from_op="Read Document (2)" from_port="output" to_port="output 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">TXT-Dateien</description>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents (2)" width="90" x="447" y="136">
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="99999"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="34">
                <parameter key="mode" value="regular expression"/>
                <parameter key="characters" value=" "/>
                <parameter key="expression" value="((-[^a-zA-Z])+)|(([^a-zA-Z]{1,}-)+)|([^a-zA-Zäöü0-9-]+)"/>
                <parameter key="language" value="German"/>
              </operator>
              <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
              <operator activated="true" class="open_file" compatibility="9.0.003" expanded="true" height="68" name="Open File" width="90" x="112" y="544">
                <parameter key="filename" value="D:\Dropbox\_BT\Textanalyse\_RapidMiner Tools\rogerwordlist3.txt"/>
              </operator>
              <operator activated="true" class="text:stem_dictionary" compatibility="8.1.000" expanded="true" height="82" name="Stem (Dictionary)" width="90" x="246" y="442"/>
              <operator activated="true" class="text:extract_token_number" compatibility="8.1.000" expanded="true" height="68" name="Extract Token Number" width="90" x="648" y="34"/>
              <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Stem (Dictionary)" to_port="document"/>
              <connect from_op="Open File" from_port="file" to_op="Stem (Dictionary)" to_port="file"/>
              <connect from_op="Stem (Dictionary)" from_port="document" to_op="Extract Token Number" to_port="document"/>
              <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">Dokumente verarbeiten</description>
          </operator>
          <connect from_op="Loop Files (3)" from_port="output 1" to_op="Process Documents (2)" to_port="documents 1"/>
          <connect from_op="Process Documents (2)" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents (2)" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @roger_rutishaus,

    To detail @Telcontar120 's proposition, you have to : 
     - Go to the MarketPlace and install the Operator Toolbox extension.
      - Then follow the instructions in this screenshot :
     
    I hope it helps,

    Regards,

    Lionel
  • roger_rutishausroger_rutishaus Member Posts: 8 Contributor I
    edited December 2018
    hi @lionelderkrikor and @Telcontar120 

    thank you. now i got the "operator toolbox way" working. 
    as far as i can see, it can be used to create custom stemming rules. but it doesn't look as if it can be used for dictionary based stemming, right? 

    @sgenzer have you had the time to look at the issue already?

    thanks again everyone involved for your time!

    regards, roger
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @roger_rutishaus looks like a bug to me. Ran it a few different ways. I'm pushing this to dev team. Thank you for the report. Meanwhile use of workaround with Operator Toolbox looks like the way to go.

    Scott

  • roger_rutishausroger_rutishaus Member Posts: 8 Contributor I
    Thanks @sgenzer
    I don't think Operator Toolbox is a way to go, as I can't find a way to use dictionary based stemming with that process (only rule based stemming).
    So I am looking forward for a solution with the "Stem (Dictionary)" process  :-)
    Roger
Sign In or Register to comment.