Market Basket Analysis: First Timer

WindsAloft · March 2010

Okay, so first of all, the tutorials are nice, I watched them all but still cannot figure out how to do a Market Basket Analysis.

I got so frustrated with the error messages that I deleted everything I created and I'm starting over and typing this step by step so maybe someone can point out my mistake.

1. Open RapidMiner
2. File, Import Data, Import CSV file
3. I selected a .csv file which I am using as a sample. It has 3 headers and sample data
CustomerID, itemID, itemCount
4. Wizard suggests CustomerID to be Nominal, itemID to be Nominal, itemCount to be integer
Here is a sample row of my data: CustomerID, D21953; itemID, E3; itemCount, 1;
5. Wizard suggests I set all roles as Regular
6. I choose my Local Repository as the location and name it DATA
7. I go to File, Open Template, Market Basket Analysis, Next
8. I leave the Values the same, since I made my example headers to match perfectly.
9. For Retrieve.repository_entry, I manually type in //My Repository/DATA Because when I click the little folder and select DATA in my repository, it stays blank.

I show 3 red errors.
"The Attribute customerIDAttributeName is missing in the input example set" - from Pivot
"The Attribute itemIDAttributeName is missing in the input example set" - from Pivot
"The Attribute customerIDAttributeName is missing in the example set" - from Set Role (quickfix)

Now what?

WindsAloft · March 2010

Um, did I perhaps post this in the wrong forum?

haddock · March 2010

Greetings Windsaloft!

Looks like the lights are on but nobody is at home, so let me confuse you further...

I've used the same template, and it needs some attention, specifically it uses macros ( the RM equivalent of variables which show as %{XXXX} in parameters ), but does not assign values to them, so no wonder it confuses you! I've butchered a template by replacing the data call with a generator, like this...


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="431" width="915">
      <operator activated="false" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Transactions"/>
      </operator>
      <operator activated="false" class="set_macro" expanded="true" height="60" name="Define Item Count" width="90" x="179" y="30">
        <parameter key="macro" value="%{itemCountAttributeName}"/>
        <parameter key="value" value="itemCount"/>
      </operator>
      <operator activated="false" class="set_macro" expanded="true" height="60" name="Define Customer" width="90" x="313" y="30">
        <parameter key="macro" value="customerIdAttributeName"/>
        <parameter key="value" value="customerId"/>
      </operator>
      <operator activated="false" breakpoints="after" class="set_macro" expanded="true" height="60" name="Define Item" width="90" x="447" y="30">
        <parameter key="macro" value="itemIdAttributeName"/>
        <parameter key="value" value="itemId"/>
      </operator>
      <operator activated="false" class="aggregate" expanded="true" height="76" name="Aggregate" width="90" x="45" y="255">
        <list key="aggregation_attributes">
          <parameter key="amount" value="sum"/>
        </list>
        <parameter key="group_by_attributes" value="customer_id|product_id"/>
      </operator>
      <operator activated="true" class="generate_transaction_data" expanded="true" height="60" name="Generate Transaction Data" width="90" x="4" y="113">
        <parameter key="number_clusters" value="1"/>
      </operator>
      <operator activated="true" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="179" y="210">
        <parameter key="group_attribute" value="Id"/>
        <parameter key="index_attribute" value="Item"/>
      </operator>
      <operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="246" y="75">
        <parameter key="default" value="zero"/>
        <list key="columns"/>
      </operator>
      <operator activated="true" class="numerical_to_binominal" expanded="true" height="76" name="Numerical to Binominal" width="90" x="380" y="75"/>
      <operator activated="false" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="313" y="300">
        <parameter key="name" value="%{customerIdAttributeName}"/>
        <parameter key="target_role" value="id"/>
      </operator>
      <operator activated="true" class="fp_growth" expanded="true" height="76" name="FP-Growth" width="90" x="447" y="210"/>
      <operator activated="true" class="create_association_rules" expanded="true" height="60" name="Create Association Rules" width="90" x="581" y="210">
        <parameter key="min_confidence" value="0.1"/>
      </operator>
      <connect from_op="Generate Transaction Data" from_port="output" to_op="Pivot" to_port="example set input"/>
      <connect from_op="Pivot" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
      <connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Actually there are also relevant samples ( 1-25 and 2-23 ), and this subject has raised its ugly head before, as a quick seach for "Market Basket" shows.

Pip Pip ;D

WindsAloft · March 2010

Thanks for the reply -- I did see a lot of XML when I was doing preliminary searches, but it was so far over my head that I couldn't understand what was going on. I had no idea you could just pate the code and go back to design view and visually see what was going on!

I tried your process but I didn't necessarily get any results that I could see.... but I am going to mess around with this.

Thanks for the reply!

land · March 2010

Hi,
still any problems?

Greetings,
Sebastian

WindsAloft · March 2010

Yes actually, with the above process I keep getting an error for Regular attributes must be of type binomial. The preview shows the ID field as nominal (I assume thats my problem)

Whats weird is, I actually *get* results with the process above (it generates its own recordset). My OWN recordset, has an ID field which is text, so when I replace the first process with a retrieve, everything transitions fine except I don't get any results. And I'm betting the nominal field is the problem.

I've tried adding a Type Conversion process in between: Nominal to Binomial. But that didn't work either.

haddock · March 2010

Ooops,

Mea maxima culpa :-[ I pasted in completely the wrong code.. this is what should have been there...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="391" width="915">
      <operator activated="false" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Transactions"/>
      </operator>
      <operator activated="false" class="set_macro" expanded="true" height="60" name="Define Item Count" width="90" x="179" y="30">
        <parameter key="macro" value="%{itemCountAttributeName}"/>
        <parameter key="value" value="itemCount"/>
      </operator>
      <operator activated="false" class="set_macro" expanded="true" height="60" name="Define Customer" width="90" x="313" y="30">
        <parameter key="macro" value="customerIdAttributeName"/>
        <parameter key="value" value="customerId"/>
      </operator>
      <operator activated="false" breakpoints="after" class="set_macro" expanded="true" height="60" name="Define Item" width="90" x="447" y="30">
        <parameter key="macro" value="itemIdAttributeName"/>
        <parameter key="value" value="itemId"/>
      </operator>
      <operator activated="false" class="aggregate" expanded="true" height="76" name="Aggregate" width="90" x="45" y="255">
        <list key="aggregation_attributes">
          <parameter key="amount" value="sum"/>
        </list>
        <parameter key="group_by_attributes" value="customer_id|product_id"/>
      </operator>
      <operator activated="true" class="generate_transaction_data" expanded="true" height="60" name="Generate Transaction Data" width="90" x="4" y="113">
        <parameter key="number_clusters" value="1"/>
      </operator>
      <operator activated="true" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="179" y="210">
        <parameter key="group_attribute" value="Id"/>
        <parameter key="index_attribute" value="Item"/>
      </operator>
      <operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="246" y="75">
        <parameter key="default" value="zero"/>
        <list key="columns"/>
      </operator>
      <operator activated="true" class="numerical_to_binominal" expanded="true" height="76" name="Numerical to Binominal" width="90" x="380" y="75"/>
      <operator activated="false" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="313" y="300">
        <parameter key="name" value="%{customerIdAttributeName}"/>
        <parameter key="target_role" value="id"/>
      </operator>
      <operator activated="true" class="fp_growth" expanded="true" height="76" name="FP-Growth" width="90" x="447" y="210"/>
      <operator activated="true" class="create_association_rules" expanded="true" height="60" name="Create Association Rules" width="90" x="581" y="210">
        <parameter key="min_confidence" value="0.1"/>
      </operator>
      <connect from_op="Generate Transaction Data" from_port="output" to_op="Pivot" to_port="example set input"/>
      <connect from_op="Pivot" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="Numerical to Binominal" from_port="original" to_port="result 2"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
      <connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Hope that goes a bit better!

WindsAloft · March 2010

The grey boxes can be deleted correct?

And is it okay if I still get the caution for FP-Growth that regular attributes must be binomial? I will experiment with this and see if I can put my own dataset into the input, and see if it works.

WindsAloft · March 2010

I think maybe the problem is that there are descriptive statistics such as Range, etc for my ID field, which happens to be all text. My dataset is OrderID and ProductID, and they are both text. Like, E30098A E230843F E230289D; Product0001, Product0002, Product00003

Perhaps that is my problem. I get the warning with the process you posted, but it actually is successful despite the warning, probably because the ID's are numbers?

haddock · March 2010

Hi,

Yep, you can bin the gray jobs, and you can ignore the warning, especially as it all runs OK. So all you need to do is replace the example generator, and all should be well....

Make sure that your meta-data matches on attribute Name and Content

Role Name Content

id Id nominal
regular Item nominal
regular Amount integer

WindsAloft · March 2010

If my example set is different than the generator in content, i.e. my ID's are text ... should I be applying the to binomial conversion at the beginning?

haddock · March 2010

Hi there,

Don't think so, concentrate first on loading the data and seeing ( from the meta-data ) that RM thinks it has data as I described.

WindsAloft · March 2010

Ok. I had been modifying the processes in the graphical view as I switched to my example set, because the column names were slightly different.

Instead of doing that, I'll simply create a new set of data which has the names and content you describe above. That should eliminate the possibility that I was making mistakes while reconfiguring the different processes.

haddock · March 2010

Hi there,

I've imported this CSV format as a data repository, and substituted that repository for the generator

Id,Item,Amount
E30098AE,Product0001,1
E230843F,Product0001,1
E230289D,Product0002,2
E30098AE,Product0002,1
E230843F,Product0001,1
E230289D,Product0002,2

And it works ( in the sense it doesn't fall over ).

;D

WindsAloft · March 2010

Okay, I'm moving forward. I stopped trying to be smart and I just renamed my headers to match the process so that I didn't have that problem.

Now it doesn't break. But my association rules are blank. However this might mean I'm filtering out rules that might have existed in my data, but didn't meet a criteria.

To get the maximum number of results, I set

FP-Growth
min number of items = 0;
positive value = [blank]
min support = 0
max items = -1
must contain = [blank]

Create Association Rules
Min Conf = 0
Gain theta = 0
laplace k = 0

But still can't see rules.

Some real rows from my data that I have, that I would expect some sort of rule would be:
Id Item Amount
D11131 E1 1
D11131 E5 1
D11124 E5 1
D11125 E5 1

I should see a rule appearing for E1 -- E5 right?

Now we're on the rigth track, I'm thinking my example data isn't very good.

haddock · March 2010

All good, now get sensible data and lower the criterion constraints until rules emerge.

Happy dredging!

WindsAloft · March 2010

I really appreciate your help so much, I hope I can learn how to use this tool!

Could you help me find the criteria that could be the maximum results? or did I have it right with my previous post?

haddock · March 2010

Take it step by step. First thing is to understand about frequent item sets, and the parameters for their generation. If in doubt, as always, check out Wikipedia. Then do the rule building end.

pip pip

steve0 · April 2010

Hi

I am just reading this post here, it is very good. I have a question- how would i modify the code to include a zip code, therefore providing associations rules by zip code for each?

Thanks you

steve0 · April 2010

Just on my previous question, is a clustering method needed for something like this? All the zip codes (attribute) are there already. I just want to see how the market basket analysis can be done by zip code so the association rules will appear as per zip code?

Thanks

land · April 2010

Hi,
do you want a rule set per zip code? Then you would have to split your data according to the zip codes and perform the process on each of this subsets. You could do this with an filter Examples and a loop value operator.

Greetings,
Sebastian

steve0 · April 2010

Hi Sebastian

Yes it is a per zip code.

I am using the code as shown


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="391" width="915">
      <operator activated="false" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Transactions"/>
      </operator>
      <operator activated="false" class="set_macro" expanded="true" height="60" name="Define Item Count" width="90" x="179" y="30">
        <parameter key="macro" value="%{itemCountAttributeName}"/>
        <parameter key="value" value="itemCount"/>
      </operator>
      <operator activated="false" class="set_macro" expanded="true" height="60" name="Define Customer" width="90" x="313" y="30">
        <parameter key="macro" value="customerIdAttributeName"/>
        <parameter key="value" value="customerId"/>
      </operator>
      <operator activated="false" breakpoints="after" class="set_macro" expanded="true" height="60" name="Define Item" width="90" x="447" y="30">
        <parameter key="macro" value="itemIdAttributeName"/>
        <parameter key="value" value="itemId"/>
      </operator>
      <operator activated="false" class="aggregate" expanded="true" height="76" name="Aggregate" width="90" x="45" y="255">
        <list key="aggregation_attributes">
          <parameter key="amount" value="sum"/>
        </list>
        <parameter key="group_by_attributes" value="customer_id|product_id"/>
      </operator>
      <operator activated="true" class="generate_transaction_data" expanded="true" height="60" name="Generate Transaction Data" width="90" x="4" y="113">
        <parameter key="number_clusters" value="1"/>
      </operator>
      <operator activated="true" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="179" y="210">
        <parameter key="group_attribute" value="Id"/>
        <parameter key="index_attribute" value="Item"/>
      </operator>
      <operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="246" y="75">
        <parameter key="default" value="zero"/>
        <list key="columns"/>
      </operator>
      <operator activated="true" class="numerical_to_binominal" expanded="true" height="76" name="Numerical to Binominal" width="90" x="380" y="75"/>
      <operator activated="false" class="set_role" expanded="true" height="76" name="Set Role" width="90" x="313" y="300">
        <parameter key="name" value="%{customerIdAttributeName}"/>
        <parameter key="target_role" value="id"/>
      </operator>
      <operator activated="true" class="fp_growth" expanded="true" height="76" name="FP-Growth" width="90" x="447" y="210"/>
      <operator activated="true" class="create_association_rules" expanded="true" height="60" name="Create Association Rules" width="90" x="581" y="210">
        <parameter key="min_confidence" value="0.1"/>
      </operator>
      <connect from_op="Generate Transaction Data" from_port="output" to_op="Pivot" to_port="example set input"/>
      <connect from_op="Pivot" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="Numerical to Binominal" from_port="original" to_port="result 2"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
      <connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Where exactly can i put these into this? Rather than zip code i am looking at State.

Thanks you

steve0 · April 2010

This is what i have tried


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="566" width="915">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="30">
        <parameter key="repository_entry" value="Total Sales by State"/>
      </operator>
      <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="165">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Item|Products_Sold"/>
      </operator>
      <operator activated="true" class="rename" expanded="true" height="76" name="Rename (2)" width="90" x="45" y="300">
        <parameter key="old_name" value="Products_Sold"/>
        <parameter key="new_name" value="Customer Buys"/>
      </operator>
      <operator activated="true" class="loop_values" expanded="true" height="76" name="Loop Values (2)" width="90" x="246" y="300">
        <parameter key="attribute" value="State"/>
        <process expanded="true" height="415" width="689">
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="45" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="State=%{loop_value}"/>
          </operator>
          <operator activated="true" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="179" y="30">
            <parameter key="group_attribute" value="Id"/>
            <parameter key="index_attribute" value="Item"/>
          </operator>
          <operator activated="true" class="numerical_to_binominal" expanded="true" height="76" name="Numerical to Binominal" width="90" x="313" y="30"/>
          <operator activated="true" class="fp_growth" expanded="true" height="76" name="FP-Growth (2)" width="90" x="447" y="30"/>
          <operator activated="true" class="create_association_rules" expanded="true" height="60" name="Create Association Rules (2)" width="90" x="581" y="120">
            <parameter key="min_confidence" value="0.95"/>
          </operator>
          <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Pivot" to_port="example set input"/>
          <connect from_op="Pivot" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
          <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth (2)" to_port="example set"/>
          <connect from_op="FP-Growth (2)" from_port="example set" to_op="Create Association Rules (2)" to_port="item sets"/>
          <connect from_op="Create Association Rules (2)" from_port="rules" to_port="out 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve (2)" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Rename (2)" to_port="example set input"/>
      <connect from_op="Rename (2)" from_port="example set output" to_op="Loop Values (2)" to_port="example set"/>
      <connect from_op="Loop Values (2)" from_port="out 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

but i keep getting Process failed. Reason: com.rapidminer.example.set.NonSpecialAttributesExampleSet cannot be cast to com.rapidminer.operator.learner.associations.FrequentItemSets

I want to show the associations by State as results one after another.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Market Basket Analysis: First Timer

Answers