Options

Splitting attributes & Getting specific records

pjmpjm Member Posts: 3 Contributor I
edited November 2018 in Help

Hi

Trying to break an attribute down into 2 pieces: main and remainder

e.g. Fruit | 2KG | £2.00

so want to break off fruit from the rest and have the remainder in another attribute.

 

Also working on a 50k dataset and want to get 1k specific id numbers i have in mind

 

thanks for help.  1st time user

Best Answers

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Solution Accepted
    Another approach is to use Generate Attributes where you take the prefix up to the first space:

    att2 prefix(att1,index(att1," "))

    If you want the remainder in another attribute:

    att3 suffix(att1,length(att1)-length(att2))

    Sometimes you can be off by one character so just add/subtract 1 as needed. I use this more than Split as it gives me a lot more customization.

    Scott
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Solution Accepted
    Oh that's much easier. Just use "Filter Examples", select "single" and your ID attribute, select the "include special attributes" checkbox, and under custom filter just make two entries: ID > 94000 and another that is ID < 149000. Make sure the "and" button at the bottom is selected.

    You can also use the Filter Example Range operator which is slightly easier but will only filter by example number which may or may not be the same as your IDs.

    Scott

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,508 RM Data Scientist

    Hi pjm,

     

    welcome to the community!

     

    The split operator should do the job. I have attached a demo processes for it.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    pjmpjm Member Posts: 3 Contributor I
    thx its looking for something in xpath variables for read xml
    . So what im looking do do is have something like: Adam| Benji | Colin
    . Then set Adam as the main after the split and the other 2 in a seperate variable sub or something. Tried for the split operator: .*| but it results in: vara: A, varb: d, verc: a, vard: m
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Not sure what you mean by the ID numbers - are the 1k IDs randomized among the 50k examples? I use the Generate ID operator sometimes but not sure this is what you're looking for.

    Scott
  • Options
    pjmpjm Member Posts: 3 Contributor I

    thx for help on generate attributes think this could help me a lot for that problem

    with the ids the 1000 are from 94,000 to just over 149,000

    but none of the other ids fall in that range

    so im looking for a subset of the csv file that only takes records in that range

    thx

  • Options
    jason_xiejason_xie Member Posts: 4 Contributor I

    Scott, 

     

    Your answer was really helpful. But what would you do if you want to split by 3rd Space?

     

    For example I have a column that has content like Nov 14 2016 12:50 AM, I want to split the date and time into 2 columns. 

     

    Thanks!

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hi @jason_xie - for that I would use a nice RegEx in the Split operator:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.000-BETA">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.0.000-BETA" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="313" y="85">
    <list key="attribute_values">
    <parameter key="text" value="&quot;Nov 14 2016 12:50 AM&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="split" compatibility="8.0.000-BETA" expanded="true" height="82" name="Split" width="90" x="581" y="85">
    <parameter key="split_pattern" value="(?&lt;=20[0-9][0-9])\s"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Split" to_port="example set input"/>
    <connect from_op="Split" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

  • Options
    jason_xiejason_xie Member Posts: 4 Contributor I

    Thanks! I ended up adding values to the index() output in the prefix() expression to adjust the space cutoffs. 

Sign In or Register to comment.