[SOLVED] Text Processing - Tokenize: keep word order

CharlieFirpoCharlieFirpo Member Posts: 48 Contributor II
edited September 2019 in Help
Dear All!

Can anybody help me to do a text tokenization in a way that remains the original word order?
I have a sample text like: "delta gamma alpha beta" I use a Process Documents operator and a Tokenize operator in it. I create a word vector that will be an example set after a WordList to Data operator. And unfortunately this result is an alphabetically ordered list, so 'alpha; beta; gamma; delta' [first, second, third, fourth rows]. I want the original word order, so an example set, where the first example is 'delta', second is 'gamma', third is 'alpha', fourth is 'beta'. Without the WordList to Data operator, I have a WordList that is also an alphabetically ordered list.
Of course this can be solved with a Loop operator in a difficult way, but this is not powerful.

So how can I tokenize in a way that remains the original word order?

Thank you!!


  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello CharlieFirpo

    How about the following
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.001">
      <operator activated="true" class="process" compatibility="6.0.001" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="75">
            <parameter key="text" value="delta gamma beta alpha&#10;delta&#10;eta&#10;alpha&#10;"/>
          <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="112" y="165">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="text" value="(\S+)"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
              <connect from_port="segment" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
          <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data" width="90" x="246" y="75">
            <parameter key="text_attribute" value="text"/>
            <parameter key="add_meta_information" value="false"/>
          <connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>

  • Options
    CharlieFirpoCharlieFirpo Member Posts: 48 Contributor II
    Thank you!

    It works perfectly! I changed the 'mode' parameter at Tokenize operator in Cut Document to 'specify characters = . ,;:' in order to handle numbers as well at the input text.

    Nice day!
Sign In or Register to comment.