[SOLVED] Text Processing - Tokenize: keep word order

Regular Contributor

[SOLVED] Text Processing - Tokenize: keep word order

Dear All!

Can anybody help me to do a text tokenization in a way that remains the original word order?
I have a sample text like: "delta gamma alpha beta" I use a Process Documents operator and a Tokenize operator in it. I create a word vector that will be an example set after a WordList to Data operator. And unfortunately this result is an alphabetically ordered list, so 'alpha; beta; gamma; delta' [first, second, third, fourth rows]. I want the original word order, so an example set, where the first example is 'delta', second is 'gamma', third is 'alpha', fourth is 'beta'. Without the WordList to Data operator, I have a WordList that is also an alphabetically ordered list.
Of course this can be solved with a Loop operator in a difficult way, but this is not powerful.

So how can I tokenize in a way that remains the original word order?

Thank you!!
Super Contributor

Re: Text Processing - Tokenize: keep word order

Hello CharlieFirpo

How about the following

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.001">
  <operator activated="true" class="process" compatibility="6.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="75">
        <parameter key="text" value="delta gamma beta alpha&#10;delta&#10;eta&#10;alpha&#10;"/>
      <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="112" y="165">
        <parameter key="query_type" value="Regular Expression"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries">
          <parameter key="text" value="(\S+)"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
          <connect from_port="segment" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
      <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data" width="90" x="246" y="75">
        <parameter key="text_attribute" value="text"/>
        <parameter key="add_meta_information" value="false"/>
      <connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
      <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>


Regular Contributor

Re: Text Processing - Tokenize: keep word order

Thank you!

It works perfectly! I changed the 'mode' parameter at Tokenize operator in Cut Document to 'specify characters = . ,;:' in order to handle numbers as well at the input text.

Nice day!