Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"Split a document in its sentences"

mahdilashkarimahdilashkari Member Posts: 3 Contributor I
edited June 2019 in Help
It has bothered me because I think it should be simple  but I have not found any solution for it.
I have a document and I want to split it in its sentences. I used tokenizer component and it only gives me a colored document. I mean I want to extract its sentences and put each sentence in a example set. my process is listed at below.
thanks a lot!

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="179" y="210">
        <parameter key="text" value="He went to USA. I went to Germany."/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="447" y="210">
        <parameter key="mode" value="linguistic sentences"/>
        <parameter key="characters" value=".:"/>
        <parameter key="language" value="English"/>
        <parameter key="max_token_length" value="3"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • SkirzynskiSkirzynski Member Posts: 164 Maven
    The tokens are colored to see a preview of the tokenization. If you want to create an example set you will have to create a word list (which is in your case a sentence list) and with the "Wordlist to Data" operator you will get an example set. See process below:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.009">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
            <parameter key="text" value="He went to USA. I went to Germany."/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.001" expanded="true" height="94" name="Process Documents" width="90" x="179" y="30">
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.001" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30">
                <parameter key="mode" value="linguistic sentences"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:wordlist_to_data" compatibility="5.3.001" expanded="true" height="76" name="WordList to Data" width="90" x="313" y="30"/>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="word list" to_op="WordList to Data" to_port="word list"/>
          <connect from_op="WordList to Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • mahdilashkarimahdilashkari Member Posts: 3 Contributor I
    thanks marcin
    your solution has solved my problem but i want to have the main text as an attribue for each splitted sentence. what should i do?
    thanks a lot gain.
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    In this case you have to:
    - use an example set that contains all documents
    - add an ID to the data
    - split the data using Cut Document*
    - activate keep_text in Process Documents
    - Join the original document data to the split data

    *) In Cut Document, you have to manually specify a regular expression that detects linguistic sentences. As you can see, the one I invented is not yet perfect.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="5.3.008" expanded="true" height="76" name="Generate Data" width="90" x="45" y="75">
            <process expanded="true">
              <operator activated="true" class="generate_data_user_specification" compatibility="5.3.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="30">
                <list key="attribute_values">
                  <parameter key="doc" value="&quot;He went to USA. I went to Germany.&quot;"/>
                </list>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="generate_data_user_specification" compatibility="5.3.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="120">
                <list key="attribute_values">
                  <parameter key="doc" value="&quot;Hanny did this. Hannah did that.&quot;"/>
                </list>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="append" compatibility="5.3.008" expanded="true" height="94" name="Append" width="90" x="179" y="30"/>
              <operator activated="true" class="nominal_to_text" compatibility="5.3.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30"/>
              <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
              <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
              <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
              <connect from_op="Nominal to Text" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_id" compatibility="5.3.008" expanded="true" height="76" name="Generate ID" width="90" x="179" y="75"/>
          <operator activated="true" class="multiply" compatibility="5.3.008" expanded="true" height="94" name="Multiply" width="90" x="313" y="75"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" compatibility="5.3.000" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="text" value="([^\.]+)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <process expanded="true">
                  <connect from_port="segment" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Document" to_port="document"/>
              <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="join" compatibility="5.3.008" expanded="true" height="76" name="Join" width="90" x="581" y="120">
            <list key="key_attributes"/>
          </operator>
          <connect from_op="Generate Data" from_port="out 1" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Join" to_port="left"/>
          <connect from_op="Join" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.