How to split an attribute based on a condition on the split pattern ?

lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 745   Unicorn
Hi,

I'm extracting usernames of e-mails and I want to split these usernames according to the 
separator between the first name and the last name. (the separator  is different for each username).

For example here the initial dataset : 

Username
john.doe
John_Doe

I want to obtain the following dataset : 

Username_1          Username_2     
john                               doe
John                              Doe


For this I tried to use the Branch operator but I'm encountered an error.

Here my process : 
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="utility:create_exampleset" compatibility="9.3.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
        <parameter key="generator_type" value="comma separated text"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="false"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="input_csv_text" value="Username&#10;john.doe&#10;John_Doe"/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply (2)" width="90" x="313" y="85"/>
      <operator activated="true" breakpoints="before" class="branch" compatibility="9.3.000" expanded="true" height="103" name="Branch" width="90" x="514" y="85">
        <parameter key="condition_type" value="expression"/>
        <parameter key="condition_value" value="[Username]==john.doe"/>
        <parameter key="expression" value="contains([Username],&quot;.&quot;)==TRUE"/>
        <parameter key="io_object" value="ANOVAMatrix"/>
        <parameter key="return_inner_output" value="true"/>
        <process expanded="true">
          <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply (3)" width="90" x="45" y="238"/>
          <operator activated="true" class="select_attributes" compatibility="9.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="238">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Username"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" breakpoints="before" class="split" compatibility="9.3.000" expanded="true" height="82" name="Split (2)" width="90" x="179" y="136">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Username"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="split_pattern" value="[.]"/>
            <parameter key="split_mode" value="ordered_split"/>
          </operator>
          <operator activated="true" class="union" compatibility="9.3.000" expanded="true" height="82" name="Union" width="90" x="380" y="136"/>
          <connect from_port="condition" to_port="input 1"/>
          <connect from_port="input 1" to_op="Multiply (3)" to_port="input"/>
          <connect from_op="Multiply (3)" from_port="output 1" to_op="Split (2)" to_port="example set input"/>
          <connect from_op="Multiply (3)" from_port="output 2" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Union" to_port="example set 2"/>
          <connect from_op="Split (2)" from_port="example set output" to_op="Union" to_port="example set 1"/>
          <connect from_op="Union" from_port="union" to_port="input 2"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
          <portSpacing port="sink_input 3" spacing="0"/>
        </process>
        <process expanded="true">
          <connect from_port="condition" to_port="input 1"/>
          <connect from_port="input 1" to_port="input 2"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
          <portSpacing port="sink_input 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="branch" compatibility="9.3.000" expanded="true" height="103" name="Branch (2)" width="90" x="648" y="85">
        <parameter key="condition_type" value="expression"/>
        <parameter key="condition_value" value="Username==John_doe"/>
        <parameter key="expression" value="contains([Username],&quot;_&quot;)==TRUE"/>
        <parameter key="io_object" value="ANOVAMatrix"/>
        <parameter key="return_inner_output" value="true"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="split" compatibility="9.3.000" expanded="true" height="82" name="Split" width="90" x="179" y="136">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Username"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="split_pattern" value="[_]"/>
            <parameter key="split_mode" value="ordered_split"/>
          </operator>
          <connect from_port="condition" to_port="input 1"/>
          <connect from_port="input 1" to_op="Split" to_port="example set input"/>
          <connect from_op="Split" from_port="example set output" to_port="input 2"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
          <portSpacing port="sink_input 3" spacing="0"/>
        </process>
        <process expanded="true">
          <connect from_port="condition" to_port="input 1"/>
          <connect from_port="input 1" to_port="input 2"/>
          <portSpacing port="source_condition" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_input 1" spacing="0"/>
          <portSpacing port="sink_input 2" spacing="0"/>
          <portSpacing port="sink_input 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Multiply (2)" to_port="input"/>
      <connect from_op="Multiply (2)" from_port="output 1" to_op="Branch" to_port="condition"/>
      <connect from_op="Multiply (2)" from_port="output 2" to_op="Branch" to_port="input 1"/>
      <connect from_op="Branch" from_port="input 1" to_op="Branch (2)" to_port="condition"/>
      <connect from_op="Branch" from_port="input 2" to_op="Branch (2)" to_port="input 1"/>
      <connect from_op="Branch (2)" from_port="input 2" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Can you help me ?

Regards,

Lionel




Best Answers

  • kaymankayman Posts: 357   Unicorn
    Solution Accepted
    Seems more like a bug with the branch operator, as it should recognize the attribute to start with.

    As for your issue, why don't you just replace all known separator symbols with an underscore using a regex? I'd assume there are not that many apart from the dot that are generally used in email addresses. And then the split would be on all for the underscore.
  • Telcontar120Telcontar120 Posts: 1,210   Unicorn
    Solution Accepted
    I think the solution from @kayman is the easiest; since there are only a few common email separators like "." and "-" and "_" then they can be replaced easily by a single one and then just use that for the split.
     
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts

Answers

  • jacobcybulskijacobcybulski Member, University Professor Posts: 83   Unicorn
    edited June 11
    Try this, a bit simpler, a sequence of two attribute generators based on a regular expression, matching the first and the second component, i.e.
    • replaceAll(name,"^([a-z0-9]+)[-_+]([a-z0-9]+)$","$1")
    • replaceAll(name,"^([a-z0-9]+)[-_+]([a-z0-9]+)$","$2")
    You can adjust the regular expression to put any separators in the middle.


    lionelderkrikor
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,383  Community Manager
    yes I would concur with @kayman @Telcontar120 this is exactly how I would approach this problem: Split using RegEx.

    Scott
    lionelderkrikor
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 745   Unicorn
    Dear all,

    Thanks you for your contributions.
    In deed, @kayman solution is giving good results on my original dataset and solves this problem.
    Once again thanks you for spending time on this problem.

    Regards,

    Lionel


    sgenzer
Sign In or Register to comment.