Token Replace

emaema Member Posts: 33 Maven
edited November 2018 in Help
Hi
can anybody give me an example to a token replace attributes

for example

replace a word ends with s with the word

dances - dance 

what would i put in replace dictionary

Thank you

Answers

  • emaema Member Posts: 33 Maven
    hi ...

    I tried token replace and it does the replace but do not remove the original word

    for example

    if dancing to be replaced by danc

    the output will have dancing and danc

    Thank you
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    did you use the operator TokenReplace before a tokenizer?

    Here is an example of the operator added to one of the example processes delivered with the Text plugin:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="graphics" value="../data/newsgroup/graphics"/>
              <parameter key="hardware" value="../data/newsgroup/hardware"/>
            </list>
            <parameter key="default_content_encoding" value="ISO-8859-1"/>
            <parameter key="prune_below" value="2"/>
            <list key="namespaces">
            </list>
            <parameter key="create_text_visualizer" value="true"/>
            <parameter key="on_the_fly_pruning" value="3"/>
            <operator name="TokenReplace" class="TokenReplace">
                <list key="replace_dictionary">
                  <parameter key="cantaloupe" value="cantaHORST"/>
                </list>
            </operator>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
                <parameter key="min_chars" value="3"/>
            </operator>
            <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
            </operator>
            <operator name="TermNGramGenerator" class="TermNGramGenerator">
            </operator>
        </operator>
    </operator>
    Cheers,
    Ingo
  • mskinnermskinner Member Posts: 10 Contributor I

    this does not seem to work

  • pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 96 RM Research

    Here is an up-to-date version:

    <operator activated="true" class="process" compatibility="5.0.000" expanded="true" name="Root">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
    <parameter key="text" value="Some text about different kind of dances people might enjoy."/>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="313" y="34">
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens" width="90" x="380" y="34">
    <list key="replace_dictionary">
    <parameter key="([a-zA-Z]+)s" value="$1"/>
    </list>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/>
    <connect from_op="Replace Tokens" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>

    Remark: Make sure to download the Text Processing Extension from the Marketplace in order for this solution to work.

     

    Key element:

    To extract a tokens substring, that matches a certain criteria, use the group feature of regular expressions. Here we identify token ending with 's' by using the expression ([a-zA-Z]+)s and refering to the targeted substring by the group identifier $1.

     

    Hope it helps.

Sign In or Register to comment.