incorporating control groups

jamesbeerbowerjamesbeerbower Member Posts: 3 Contributor I

I'm a newbie to data mining and I'm trying to figure out how to use a control group in the analyse.  e.g. I have 120000 customers who are candidates to receive a mailing of which only 100000 (randomly selected)  do receive the mailing.  The effect that we want to maximize is the difference between the control group and the target group reponse.  I did find a paper exploring the issue from Victor Lo


He has some extensive suggestions -- the essence of which are quoted below.  Can anyone comment on this?  Is there better or different ways? Are there tools available in Rapidminer explicitly to help with control group analysis?

Jamie Beerbower
"Include data, {Yi,Xi} from both the treatment and control
groups in the analysis data set;
2. Assign a dummy variable Ti to 1 for the treatment group and
0 for the control group;
3. Divide the data set into training and hold-out samples;
4. Further divide the training sample into two sub-samples by
Ti, i.e. one is treatment and the other is control;
5. Choose a variable selection method (or called feature
extraction). In each sub-sample (treatment and control), use
the method to narrow down your list of independent
variables, Xi (often an essential step in data mining as there
are normally hundreds of independent variables);
6. Take the union of the two reduced sets of independent
variables from 5 and thus, the new Xi has only q elements,
where q<original number of independent variables, p;
7. Multiply all independent variables, Xi, (from step 6) by Ti to
form the interaction effects, Xi*Ti;
8. Choose a data mining or statistical technique for supervised
9. Fit a model using Yi as the dependent variable and Xi, Ti, and
Xi*Ti as independent variables;
10. Use stepwise procedure (or similar model selection
procedure) to determine the best parsimonious model.
After the best model is selected, we propose the following
procedure for validation using the holdout sample:
1. For each individual in the hold-out sample, compute the
predicted values of expected Yi for both the treatment and
control, i.e. predict E(Yi|Xi;treatment) and E(Yi|Xi;control);
2. Subtract the control value from the treatment value to
estimate the treatment and control difference (in order to
achieve objective (3));
3. Rank and decile the entire hold-out sample by the predicted
4. In each decile, compute the observed mean value of Yi’s in
the treatment group and the observed mean value of Yi’s in
the control group and then take the observed difference;
5. Plot the observed difference between treatment and control
by decile to validate the model;
6. The expected true lift can be measured by how much the top
decile(s) perform better than random using the observed
treatment and control difference from step 6."

From The True Lift Model - A Novel Data Mining Approach to
Response Modeling in Database Marketing
Victor S.Y. Lo


  • Options
    jamesbeerbowerjamesbeerbower Member Posts: 3 Contributor I

    Looks like I will have to try to answer my own question ;).  There is no general agreement on what to call analyse of difference between control group and target group in datamining.  Some of the terms are

        *  uplift modelling
        * differential response analysis
        * incremental modelling
        * incremental impact modelling
        * true response modelling
        * true lift modelling
        * proportional hazards modelling
        * net modelling.

    There is an FAQ on uplift modelling at http://scientificmarketer.com/2007/09/uplift-modelling-faq.html

    "Using Control Groups to Target on Predicted Lift:" (2007) gives an overview of the techniques.


    Jamie Beerbower
    Hochheim am Main
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    I have scanned through your description in your first post and as far as I can see you can setup all steps with RapidMiner if this is any help of you. You however have to use some FeatureGeneration operators, several example filterings and one or two merges. For the average calculation in the deciles, you would have to use a discretization together with an aggregation operator.

    Maybe this encourages you to set up the whole process but be aware that it will get, well, a bit complex  ;)

  • Options
    steffensteffen Member Posts: 347 Maven
    Hello Jamie

    Thank you for the links  :D. This is a really interesting topic.
    I want to add this one: http://en.wikipedia.org/wiki/Uplift_modelling, primarily for the terms and groups in customer segmentation.

    So little time, so much to learn...


  • Options
    jamesbeerbowerjamesbeerbower Member Posts: 3 Contributor I
    Hi all,

    Ingo, thanks for taking the trouble of checking whether Dr. Lo's battle plan is feasible.  Before I go ahead and implement it I need to take a look at the other strategies and (most importantly) get my head around the quality measurement strategy.

    Practically I doubt any new interesting hypothesis can come from the process -- we (and most everyone else in the world) simply don't have the quantitiy of data to look at more than one dimension (one factor) at a time.  The difference between "treated" and "untreated" is simply too small in the advertising world. 

    viele grüße

    Jamie Beerbower
    Hochheim am Main
  • Options
    NisaNisa Member Posts: 1 Contributor I
    Hi steffen
    I found the link which u posted very useful.
    Thanks for sharing....
Sign In or Register to comment.