How to create new examples by spliiting at punctuation marks?

chrisniemchrisniem Member Posts: 6 Contributor II
edited November 2018 in Help
Hi all!

I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:

2012-05-04          Source1          Speaker1          Context1          "The unsettling prospects come at a time of growing uncertainty for the country’s    economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06          Source2          Speaker2          Context2          "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."

What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:

2012-05-04          Source1          Speaker1          Context1          "The unsettling prospects come at a time of growing uncertainty for the country’s    economy."
2012-05-04          Source1          Speaker1          Context1          "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06          Source2          Speaker2          Context2          "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06          Source2          Speaker2          Context2          "Others have been cutting their corn early to use for feed, a much less profitable venture."

Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.

Any help is very appreciated!

Thanks

Chris

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Chris,

    you can use e.g. Cut Documents for this. You may have to tune the regular expression a bit, but the process below depicts the general idea.

    Best,
      ~Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="505" width="721">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="120">
            <list key="attribute_values">
              <parameter key="meta" value="false"/>
              <parameter key="text" value="&quot;This is also a test. With two sentences.&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="30">
            <list key="attribute_values">
              <parameter key="meta" value="true"/>
              <parameter key="text" value="&quot;Test. Sentence. Blubb.&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="179" y="30"/>
          <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="text"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true" height="505" width="658">
              <operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries">
                  <parameter key="t" value="\..\."/>
                </list>
                <list key="regular_expression_queries">
                  <parameter key="t" value="([^\.]+)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <process expanded="true" height="523" width="658">
                  <connect from_port="segment" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Document" to_port="document"/>
              <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="|meta|text"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • chrisniemchrisniem Member Posts: 6 Contributor II
    Hi Marius,

    great, that will do it!

    Thanks a lot!

    Chris
Sign In or Register to comment.