Looking for Guidance on Pattern Analysis

jasonaz480jasonaz480 Member Posts: 4 Contributor I
edited November 2018 in Help

I'm just getting my feet wet with RM. I was previously doing basic analysis writing DB queries and with scripts, but it is taxing on me and the server to do it that way. I will admit that it is also a little taxing trying to collate all the data from relational tables into big flat files with a ton of columns, but I'm sure part of that is just my inexperience with the best way to setup the data for the processes.

 

What I am trying to do is have an analysis of N-attributes that essentially shows what pattern of behaviors result in the most client spends. What I did was create a CSV with all binomials (0 = no / 1 = yes) for various actions the client took and one column for the amount the client has spent lifetime with us. IE:

Took Action 1?, Took Action 2?, Took Action 3?, ....., Took Action 15?, Total Spent

 

I was able to use the Correlation Matrix to find the correlation between indvidual actions and the amount spent. I was already able to do that with some DB queries, but it was nice to see them all compared to each other though. But what I was really after was combinations of say at least 2 or 3 attributes and ideally not just creating a correlation weight but an actual average on the target attribute (amount spent). IE: Action 1 + Action 3 + Action 12 = avg $

 

Even that is not too bad to do it my old way writing code, but it seems much faster and I'd like to incorporate non-binomial attributes though too, like number of purchases and age of client and eventually create segments on those to add to the model. I don't mean to digress there...

 

So I guess my question is what operator(s) should I be looking at that mutates the attributes into combinations to find the highest value patterns? Also, I've read the help on the various operators and browsed the manual and a couple other resources along with Youtube, but I feel like I'm not sure how to unleash the real powerful analysis possible because I'm kind at that spot in the journey where you don't know what you don't know. Is there a good resource on the operators that walks through a many different use cases with specific examples, especially geared towards client scoring/prediction/forecasting?

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi jasonaz480,

     

    nice to have you on the community. 

     

    Did you have a look on our getting started page at https://rapidminer.com/getting-started-central/? In general you want to built a model predicting your amount spent. Afterwards you can either use this model or do some feature weighting to figure out which attribute was how important.

    I recently wrote a longer article on our KB - http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Feature-Weighting-Tutorial/ta-p/35281 , just have a look.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • jasonaz480jasonaz480 Member Posts: 4 Contributor I

    Thank you for the reply. I have gone through most of the in-program tutorials and have been looking at other one-off tutorials that seem relevant to my needs. I have a couple in-program ones left I haven't finished, but at first glance they don't seem to cover what I'm after. The software is amazing, albeit overwhelming for someone like myself who while I am a developer I don't have a background in stats/analysis outside of the basic business logic. I guess I should say it is more the maths and knowing which models and algorithms to select for different purposes which is the tough part, the software makes it way easier though to dig into though, so I'd give it two thumbs up. I couldn't imagine getting as far as I have writing code or using a CLI.

     

    I will definetly check out the weighting tutorial you referenced. I was able to get similar weights from the Correlation Matrix I ran on my data and that was useful.

     

    But all of the tutorials, and including the matrix results are single dimension driven. What I am trying to do is do a combination of all attributes and determine an average spend amount (though also a weight or correlation value in addition to that would be nice for reference). After some more searching it seemed like "de-pivot" might be the operator I am looking for but upon trying it I wasn't able to figure out how to apply to what I am trying to do.

     

    I will try to explain better by giving a sample of my data/schema:

    action 1, action 2, action 3, ..., action 10, spent

    0,1,1,...,0,756.50

    0,0,1,...,0,0

    1,1,0,00,1,28.12

     

    My schema as about about 10 attributes, and I want to find the average spent using all 10 possible combinations (10 X 10 = 100 resulting rows). For the sake of simplicity though I will show the output I'm looking for if there were just 3 attributes (3 binomial fields representing various customer actions).

    Actions Taken,Average Spent

    Action 1,38.75

    Action 1 & Action 2,28.54

    Action 1 & Action 3,24.14

    Action 1 & Action 2 & Action 3,28.54

    Action 2,18.25

    Action 2 & Action 3,24.14

    Action 3,18.25

    None,3.87

     

    I know how to write plain python code to parse the results but it is time consuming to do one off reports that way and I figure this is probably an elementary task for RM.

     

    The only thing I should clarify is that I typically write these reports in two ways... "Any" and "Only".

    Using just a 3 attribute (well 4 including the "spent" target) set, here is what I mean

    Header: Action 1, Action 2, Action 3, Spent

    Row 1: 0,1,1,20

    Row 2: 0,1,0,30

    Row 3: 1,1,1,10

     

    Let's say I am trying to find "Action 2"...

    In an "only" report it would only calculate row 2 for an average of $30.

    In an "any" report it would calculate all 3 rows for an average of $20

    Likewise when calculating "Action 2 & Action 3"...

    In an "only" report it would only calculate row 1 for an average of $20

    In an "any" report it would calculate row 1 and row 3 for an average of $15

     

    I apologize for the long post but I was hoping if I could be very specific it would help explain what I am after. I'm sure there must be a way to merge the attributes in all possible combos and get an average from a target column but I can't seem to figure it out.

     

    Thank again,

    Jay

     

  • jasonaz480jasonaz480 Member Posts: 4 Contributor I

    A little more Googling and it looks like Loop Attribute Subsets is the operator I'm looking for I think.

     

    I apply it after my dataset but it does not modify the output data (as confirmed by the operator description). Instead, I can find the the resulting attribute combination names in an extra result window called "Log", but without these coming through in the output I'm not sure how to create further operators to manipulate the data as I'm trying to do.

     

    It seems like this would be a very trivial task to find the resulting average value from a column in the rows that match the various combinations but I'm kind of stumped. I'll keep plugging away though. :)

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi Jason,

     

    i've built a process using cartesian product doing the job, Not the nicest solution of all but well working. Not sure if there is a more elegant way to do this.

     

    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="7.3.001" expanded="true" height="82" name="Subprocess" width="90" x="45" y="34">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.3.001" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Sonar"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="attribute_14|attribute_13|attribute_12|attribute_11|attribute_10"/>
    </operator>
    <connect from_op="Retrieve Sonar" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.3.001" expanded="true" height="82" name="Aggregate" width="90" x="179" y="34">
    <parameter key="use_default_aggregation" value="true"/>
    <list key="aggregation_attributes"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="7.3.001" expanded="true" height="82" name="Transpose" width="90" x="313" y="34"/>
    <operator activated="true" class="set_role" compatibility="7.3.001" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
    <parameter key="attribute_name" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="145" name="Multiply" width="90" x="581" y="238"/>
    <operator activated="true" class="subprocess" compatibility="7.3.001" expanded="true" height="103" name="Subprocess (2)" width="90" x="715" y="34">
    <process expanded="true">
    <operator activated="true" class="cartesian_product" compatibility="7.3.001" expanded="true" height="82" name="Cartesian" width="90" x="45" y="34">
    <parameter key="remove_double_attributes" value="false"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.3.001" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
    <parameter key="parameter_expression" value="!contains(id,id_from_ES2)"/>
    <parameter key="condition_class" value="expression"/>
    <list key="filters_list"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="34">
    <list key="function_descriptions">
    <parameter key="id" value="concat(id,&quot;+&quot;,id_from_ES2)"/>
    <parameter key="att_1" value="att_1+att_1_from_ES2"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="att_1|id"/>
    </operator>
    <connect from_port="in 1" to_op="Cartesian" to_port="left"/>
    <connect from_port="in 2" to_op="Cartesian" to_port="right"/>
    <connect from_op="Cartesian" from_port="join" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="source_in 3" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply (2)" width="90" x="849" y="34"/>
    <operator activated="true" class="subprocess" compatibility="7.3.001" expanded="true" height="103" name="Subprocess (3)" width="90" x="983" y="136">
    <process expanded="true">
    <operator activated="true" class="cartesian_product" compatibility="7.3.001" expanded="true" height="82" name="Cartesian (2)" width="90" x="45" y="34">
    <parameter key="remove_double_attributes" value="false"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.3.001" expanded="true" height="103" name="Filter Examples (2)" width="90" x="313" y="34">
    <parameter key="parameter_expression" value="!contains(id,id_from_ES2)"/>
    <parameter key="condition_class" value="expression"/>
    <list key="filters_list"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.3.001" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="447" y="34">
    <list key="function_descriptions">
    <parameter key="id" value="concat(id,&quot;+&quot;,id_from_ES2)"/>
    <parameter key="att_1" value="att_1+att_1_from_ES2"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="att_1|id"/>
    </operator>
    <connect from_port="in 1" to_op="Cartesian (2)" to_port="left"/>
    <connect from_port="in 2" to_op="Cartesian (2)" to_port="right"/>
    <connect from_op="Cartesian (2)" from_port="join" to_op="Filter Examples (2)" to_port="example set input"/>
    <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Select Attributes (3)" to_port="example set input"/>
    <connect from_op="Select Attributes (3)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="source_in 3" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="7.3.001" expanded="true" height="124" name="Append" width="90" x="1184" y="136"/>
    <connect from_op="Subprocess" from_port="out 1" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Subprocess (2)" to_port="in 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Subprocess (2)" to_port="in 2"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Subprocess (3)" to_port="in 2"/>
    <connect from_op="Multiply" from_port="output 4" to_op="Append" to_port="example set 3"/>
    <connect from_op="Subprocess (2)" from_port="out 1" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Subprocess (3)" to_port="in 1"/>
    <connect from_op="Subprocess (3)" from_port="out 1" to_op="Append" to_port="example set 2"/>
    <connect from_op="Append" from_port="merged set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • jasonaz480jasonaz480 Member Posts: 4 Contributor I

    That is very kind of you to provide that example. I am going to give it a try shortly. I'm starting to learn just enough to become dangeous over the last couple days. Initially my goals were not so much of creating predictive models but more static (historical) reporting with a high degree of segmentation and views. After reading a couple books and taking more tutorials I am getting drawn in deeper. The unfortunate part is that the data we available to mine does not present strong enough correlators to create predictions with any confidence. I have been trying different models to get a yes/no answer on if they will buy a particular product, but there doesn't seem to be a clear trends found in the models with the data I have available at this time. I'm going to go back to zooming out like the main intention was and try to get marketing segments sorted in priority by the average spend based on up to 3 combinations of attributes. Any deeper than that and many of the resulting combos have a small sample size that it would be hard to give much credence to it anyways.

     

    The amount of analysis that is possible though is exciting. The challenge on this particular project is if I will be able to acquire data that is actually meaningful when analyzed.

     

    Thank you.

    Jay

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi Again,

     

    great to hear that you got infected by the data-mining virus.

     

    Keep in mind that the challenge is usually to built the right predictors so that the algorithms are able to find the pattern. A lot of work in  modelling is actually to create the right features.

     

    What learners did you try afterwards? I would recommend: Gradient Boosted Tree, (radial) SVM and Deep Learning. The SVM might be very stromg if you have a lot of attributes. The Gradient Boosted Tree (or a Random Forest for a quick look) works well on nominal attributes. Deep Learning is generally a hype theme. I got some good results on chemical data with heavy non-linear interactions.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.