Sampling or filtering transactions down to last X per customer

PaulMPaulM Member Posts: 12 Contributor II
edited March 25 in Help
I have a large exampleset with every single transaction completed by each customer over a long time frame. I want to filter this down to the last X transactions that each customer did during their lifetime (e.g. up to the last 5 - if they did fewer than this then at least the ones they did). 

I really don't want to have to loop through customer by customer as performance will be dire. Is there a smarter/simpler way of doing this? Each transaction has a timestamp and a client id.

Thanks in advance for any suggestions. 

Answers

  • PaulMPaulM Member Posts: 12 Contributor II
    Thanks @sgenzer but won't that just get me the last X transactions from all clients rather than the last X transactions done by each client?

    I have a process that loops through each client ID, filters the exampleset down to just the transactions for that client and does a filter example range for each one but as you can imagine performance is pretty darn bad.

    Any other possible workarounds? (e.g. a smart way of numbering the transaction sequence per client and then just filtering the whole lot for sequence numbers <= X)?
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,321  RM Data Scientist
    Hi @PaulM ,
    this is actually the use case we've build Group Into Collection for. You can first Group by Client ID and then use Loop Collection + Sort and FIlterExamples Range to do this.
    If you need to do it in parallel, you can use Extract Macro (Format) to get the size of the collection and then the Normal Loop with Select to process it in parallel manner.

    Attached is an example showing how to get the 3 lowest passenger fairs for each Passenger Class of the titanic data set.

    Best,
    Martin


    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Samples/data/Titanic"/>
          </operator>
          <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.4.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="246" y="34">
            <parameter key="group_by_attribute" value="Passenger Class"/>
            <parameter key="group_by_attribute (numerical)" value=""/>
            <parameter key="sorting_order" value="none"/>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="380" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="45" y="34">
                <parameter key="attribute_name" value="Passenger Fare"/>
                <parameter key="sorting_direction" value="decreasing"/>
              </operator>
              <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="179" y="34">
                <parameter key="first_example" value="1"/>
                <parameter key="last_example" value="3"/>
                <parameter key="invert_filter" value="false"/>
              </operator>
              <connect from_port="single" to_op="Sort" to_port="example set input"/>
              <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
              <connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Titanic" from_port="output" to_op="Group Into Collection" to_port="exa"/>
          <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    PaulMsgenzer
  • PaulMPaulM Member Posts: 12 Contributor II
    Many thanks @mschmitz, sounds like a good solution. My client does not have 9.6 installed yet so I can't test the group_into_collection operator against their datasets just yet, but as soon as we get that sorted out I'll give it a go. 

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,321  RM Data Scientist
    Hi @PaulM ,
    group into collection does not requiere 9.6. It is part of Operator Toolbox extension. I think the current requierement is 8.2.
    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • PaulMPaulM Member Posts: 12 Contributor II
    Hi @mschmitz,

    Ah ok, good to know - I misinterpreted your process XML. Client is rather strict about installation of extensions as well and that one is not currently installed - will have to go through their process to get it so unfortunately there will be a lag before I can test it.

    Thanks again.
    Paul

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,873  Community Manager
    hi @PaulM yes sorry I posted a quick solution, realized that it did not answer your question, and then deleted my post. :smile: And of course @mschmitz came up with a better solution anyway.
Sign In or Register to comment.