Simple Market basket Analysis

rosesroses Member Posts: 2 Contributor I
edited November 2018 in Help
Hi Community,

i am a German student and i've got the task to make a market basket analysis (i hope its correctly translated). It sounds very simple (and maybe it is :) ) so i will start explaining:

My Data:
The BonID is unimportant, the BonNr is the Number of a Bon (I think its bill) and the ArtikelBez is the name of the article on this token.
For Example on the first bill is Feinwaschmittel, Kaugummi and Vollwaschmittel.

My task is now to "see" "association purchases". For example 'Feinwaschmittel' is allways bought together with 'Vollwaschmittel'.

I've done tutorials und testet RM for multiple hours but i dont get it. Maybe because of my bad english :) Can someone please explain me, which RM-Components i need? Of course Apriori and/or FPGrowth but in which order, which settings and why?? ^^ And which other components?

Thanks a lot and excuse my bad English.

Best Regards

Marianne Rose


  • jlojlo Member Posts: 10 Contributor II
    Hi Marianne:

    [I'm using version 4.6. These things might be different in version 5. I haven't made the transition yet].

    You have the classical case of MBA. There are 2 ways in which your data might be formatted:

    1) A Binary Matrix which is just one variable that identifies uniquely the transaction + n columns that represent the different products available at the store.
    Each row is a transaction. You identify the products a customer buys by entering 1s in the corresponding columns. Example

    tid, bananas, apples, pears, grapes
    1, 1, 0, 0, 1
    2, 0, 1, 0, 1
    3, 1, 1, 1, 0

    The first transaction includes bananas and grapes. The second apples and grapes. The third bananas, apples and pears.

    ( To obtain Rules from this type, you would read the data into Rapid-I, transform the 1s and 0s into Trues and False with an operator Numeric2Binomial and then apply FPGrowth + Rules Generator. )

    2) Two columns: one for the unique transaction ID, the other one for the product (there may be others indicating other info : items bought, date, discounts, etc). This is obviously the most efficient way to store the information. You would represent the example above in the following way

    tid, product
    1, bananas
    1, grapes
    2, apples
    2, grapes
    3, bananas
    3, apples
    3, pears.

    Your data are formatted in the second way. The nice people at Rapid-I have written the code to read and process that info in Rapid-I. Take a look at the sample code Transaction2Basket.xml that you can find in the folder \samples\Preprocessing\.

    Apriori and FPGrowth are two different algorithms for finding Frequent Items. From this you can construct Association Rules. Take a look also at the sample code \samples\Learner\AssociationRules.xml.

    The best way to learn this program is to go thru the examples provided by Rapid-I in the samples folder.

    This should get you started.

  • rosesroses Member Posts: 2 Contributor I
    Thanks a lot for this detailed answer. With the Transaction2Basket.xml most of the way is done! :) And it is so simple...

    If i have other questions for the next tasks, i know where i have to ask, after i made it thru the examples.

    But i think the examples will tell me everything i want to know :)

    If anyone needs a simple tutorial for the beginning:: (ger) (eng)

    Thanks a lot jlo.

    Until next time :)

    Greetings Marianne
  • rapexorapexo Member Posts: 4 Contributor I
    hi to all,

    i have a similar mba problem. i am using rapiminer 5. my data is structured in the second way (but 4 columns total, but the interesting ones should only be bonID, and articleNR). i selected bonID as "ID" and articleNR is just "regular" (or should it be "label"?).

    after loading the data, i added the "nominal to binary" operator and connected it to the "fp-growth" operator (pre to exa). rm complains that "Meta data is underspecified. Cannot check precondition". wich attribute role must have been set for "articleNR"? or is it anything else that is wrong? also i am not sure about the "nominal to binominal" operator. what has to be set there?
  • rapexorapexo Member Posts: 4 Contributor I
    hi, still working on the same problem! but in the meantime i realized that the attempts i described in the post before could not work. now my problem is that i (still) have data like described (similar to the example data "Market-Data" of rm) and i do not know how to join or aggregate so i get complete baskets. the example shown in the tutorial has different kint of data so it is pretty useless for me.

    to  set  the  record  straight, my aim is just a simple mba (apriori or fp-g), this is how my data looks like:
    attrib_1 attrib_2 transacionNR aticleNR
    abc      yxz                1                321
    kdd      dms                1                654
    bic        fsf                  1                789
    osi        fpg                  2                258
    ais        mss                2                159

    => two baskets with three and two articles ([321,654,789] and [258,159])

    thanks in advance!
  • jlojlo Member Posts: 10 Contributor II

    Sorry but I don't see any difference between the format of your data and
    this format:

    tid, item
    1, bananas
    1, grapes
    2, apples
    2, grapes
    3, bananas
    3, apples
    3, pears.

    You should be able to use the sample program to perform the transformation to a binary matrix and then the finding of itemsets and association rules.
    Filter the variables you don't need (like attrib1 and atrib2) and declare your variable transactionNR as TID and articleNR as ITEM. I don't know if your problems are related to the version of the program you are using but I truly doubt it. I use version 4.6.

  • rapexorapexo Member Posts: 4 Contributor I
    Indeed, this data i use is of the same type. this is not my actual problem. what i want to know is how data of this format has to be processed to be able to create frequent patterns (for example with a fp-g). in the tutorial example (is there another one consisting of my type of data?) they use data of a different format. it looks like that:

    id    label          a1    a2    a3    a4
    id1  iris-seto    5.1    3.5    1.4  0.2
    id2  iris-seto  4.9    3.0    1.4  0.2 
    id3  iris-vers  4.7    3.2    1.3  0.2

    there the preprocessing operator chain consists of the frequency discretization operator, which discretizes numerical attributes by putting the values into bins of equal size an the conversion (of those bins) into true and false. there you get a different type of output (from the preprocessing) it looks something like that:

    id    label    a1range1 a1range2 a1range3 a1range4  a2range1  a2range2 a2range3 a2range4 etc....
    id1  iris-se      false         true       false         false false false false false
    id2  iris-se      true         false false false false false true         false

    that means that you have every basket in one row. the type of date i am using (similar to the type you described) has the baskets spread over several rows. on top i have a few thousands of different articles. which preprocessing steps have to be done to get frequent patterns created?

    thanks again!
  • nmarknmark Member Posts: 1 Contributor I
    What if we have up to 100 000 items (goods) and several millions tids (transactions)?
    It's hard to transpose this data to binary matrix.
    Any help?
Sign In or Register to comment.