Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"[SOLVED] How to Split an Attribute and Keep the Split Character?"

jan_kvacekjan_kvacek Member Posts: 4 Contributor I
edited June 2019 in Help
Hello!

I have a trubble with splitting attributes in Rapidminer Studio. My attribute looks like this:

"A002W0541G001"

I need to split it to several new attributes:

"A002"  "W0541"  "G001"  and so on.

But Split always dropps the character I use to determine where to split the original attribute. Is there any way to keep it?

Thank you for help!

Jan
Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    If its always 4 chars, 5 chars 5 chars you might simply use Generate Attributes with cut?
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • jan_kvacekjan_kvacek Member Posts: 4 Contributor I
    Martin Schmitz wrote:

    If its always 4 chars, 5 chars 5 chars you might simply use Generate Attributes with cut?
    Unfortunately it is not. I need to do something like "find a letter, take the latter and all numbers behind it and make it new attribute"
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    It sounds like you just need the right RegEx. 
    Assuming you have a pattern of [Letter+Numbers][Letter+Numbers] then this works: "(?<=[0-9]++)(.*?)(?=[A-Z])"
    Negative lookbehind to check there are numbers before, lookahead to check for the letter.  Anything inbetween is used to split.

    Sample process below:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.1.000-BETA">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.1.000-BETA" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="parallelize_main_process" value="false"/>
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="7.1.000-BETA" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="85">
            <list key="attribute_values">
              <parameter key="myData" value="&quot;A002W0541G001&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="7.1.000-BETA" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="187">
            <list key="attribute_values">
              <parameter key="myData" value="&quot;A02202W0541G001G002231&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="append" compatibility="7.1.000-BETA" expanded="true" height="103" name="Append" width="90" x="313" y="85">
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="merge_type" value="all"/>
          </operator>
          <operator activated="true" class="split" compatibility="7.1.000-BETA" expanded="true" height="82" name="Split" width="90" x="447" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="myData"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="split_pattern" value="(?&lt;=[0-9]++)(.*?)(?=[A-Z])"/>
            <parameter key="split_mode" value="ordered_split"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_op="Split" to_port="example set input"/>
          <connect from_op="Split" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    On a side-note... anyone happen to know the right RegEx to split into n-grams?  I want one that splits a nominal value like "RapidMiner" into "Ra ap pi id dM Mi in ne er"... can you think of one?  When I try it I always get "Ra pi dM in er" which isn't right.  I wrote a rather complex loop to do it instead, but would prefer if could do it with one operator.
  • jan_kvacekjan_kvacek Member Posts: 4 Contributor I
    JEdward wrote:

    It sounds like you just need the right RegEx. 
    Assuming you have a pattern of [Letter+Numbers][Letter+Numbers] then this works: "(?<=[0-9]++)(.*?)(?=[A-Z])"
    Negative lookbehind to check there are numbers before, lookahead to check for the letter.  Anything inbetween is used to split.
    This just does the thing! Thank you.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    very nice.  Thanks.  This is something I face often.  Maybe a feature request to simply add a checkbox option to keep the split text instead of removing it?  ;)
Sign In or Register to comment.