RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

How to create new examples by spliiting at punctuation marks?

chrisniemchrisniem Member Posts: 6 Contributor II
edited November 2018 in Help
Hi all!

I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:

2012-05-04          Source1          Speaker1          Context1          "The unsettling prospects come at a time of growing uncertainty for the country’s    economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06          Source2          Speaker2          Context2          "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."

What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:

2012-05-04          Source1          Speaker1          Context1          "The unsettling prospects come at a time of growing uncertainty for the country’s    economy."
2012-05-04          Source1          Speaker1          Context1          "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06          Source2          Speaker2          Context2          "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06          Source2          Speaker2          Context2          "Others have been cutting their corn early to use for feed, a much less profitable venture."

Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.

Any help is very appreciated!

Thanks

Chris

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869   Unicorn
    Hi Chris,

    you can use e.g. Cut Documents for this. You may have to tune the regular expression a bit, but the process below depicts the general idea.

    Best,
      ~Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="505" width="721">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="120">
            <list key="attribute_values">
              <parameter key="meta" value="false"/>
              <parameter key="text" value="&quot;This is also a test. With two sentences.&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="30">
            <list key="attribute_values">
              <parameter key="meta" value="true"/>
              <parameter key="text" value="&quot;Test. Sentence. Blubb.&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="179" y="30"/>
          <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="text"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true" height="505" width="658">
              <operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries">
                  <parameter key="t" value="\..\."/>
                </list>
                <list key="regular_expression_queries">
                  <parameter key="t" value="([^\.]+)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <process expanded="true" height="523" width="658">
                  <connect from_port="segment" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Document" to_port="document"/>
              <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="|meta|text"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • chrisniemchrisniem Member Posts: 6 Contributor II
    Hi Marius,

    great, that will do it!

    Thanks a lot!

    Chris
Sign In or Register to comment.