# incorporating control groups

jamesbeerbower
Member Posts:

**3**Contributor I
in Help

Hi,

I'm a newbie to data mining and I'm trying to figure out how to use a control group in the analyse. e.g. I have 120000 customers who are candidates to receive a mailing of which only 100000 (randomly selected) do receive the mailing. The effect that we want to maximize is the difference between the control group and the target group reponse. I did find a paper exploring the issue from Victor Lo

www.sigkdd.org/explorations/issues/4-2-2002-12/lo.pdf

He has some extensive suggestions -- the essence of which are quoted below. Can anyone comment on this? Is there better or different ways? Are there tools available in Rapidminer explicitly to help with control group analysis?

Thanks!

Jamie Beerbower

"Include data, {Yi,Xi} from both the treatment and control

groups in the analysis data set;

2. Assign a dummy variable Ti to 1 for the treatment group and

0 for the control group;

3. Divide the data set into training and hold-out samples;

4. Further divide the training sample into two sub-samples by

Ti, i.e. one is treatment and the other is control;

5. Choose a variable selection method (or called feature

extraction). In each sub-sample (treatment and control), use

the method to narrow down your list of independent

variables, Xi (often an essential step in data mining as there

are normally hundreds of independent variables);

6. Take the union of the two reduced sets of independent

variables from 5 and thus, the new Xi has only q elements,

where q<original number of independent variables, p;

7. Multiply all independent variables, Xi, (from step 6) by Ti to

form the interaction effects, Xi*Ti;

8. Choose a data mining or statistical technique for supervised

learning;

9. Fit a model using Yi as the dependent variable and Xi, Ti, and

Xi*Ti as independent variables;

10. Use stepwise procedure (or similar model selection

procedure) to determine the best parsimonious model.

After the best model is selected, we propose the following

procedure for validation using the holdout sample:

1. For each individual in the hold-out sample, compute the

predicted values of expected Yi for both the treatment and

control, i.e. predict E(Yi|Xi;treatment) and E(Yi|Xi;control);

2. Subtract the control value from the treatment value to

estimate the treatment and control difference (in order to

achieve objective (3));

3. Rank and decile the entire hold-out sample by the predicted

difference;

4. In each decile, compute the observed mean value of Yi’s in

the treatment group and the observed mean value of Yi’s in

the control group and then take the observed difference;

5. Plot the observed difference between treatment and control

by decile to validate the model;

6. The expected true lift can be measured by how much the top

decile(s) perform better than random using the observed

treatment and control difference from step 6."

From The True Lift Model - A Novel Data Mining Approach to

Response Modeling in Database Marketing

Victor S.Y. Lo

I'm a newbie to data mining and I'm trying to figure out how to use a control group in the analyse. e.g. I have 120000 customers who are candidates to receive a mailing of which only 100000 (randomly selected) do receive the mailing. The effect that we want to maximize is the difference between the control group and the target group reponse. I did find a paper exploring the issue from Victor Lo

www.sigkdd.org/explorations/issues/4-2-2002-12/lo.pdf

He has some extensive suggestions -- the essence of which are quoted below. Can anyone comment on this? Is there better or different ways? Are there tools available in Rapidminer explicitly to help with control group analysis?

Thanks!

Jamie Beerbower

"Include data, {Yi,Xi} from both the treatment and control

groups in the analysis data set;

2. Assign a dummy variable Ti to 1 for the treatment group and

0 for the control group;

3. Divide the data set into training and hold-out samples;

4. Further divide the training sample into two sub-samples by

Ti, i.e. one is treatment and the other is control;

5. Choose a variable selection method (or called feature

extraction). In each sub-sample (treatment and control), use

the method to narrow down your list of independent

variables, Xi (often an essential step in data mining as there

are normally hundreds of independent variables);

6. Take the union of the two reduced sets of independent

variables from 5 and thus, the new Xi has only q elements,

where q<original number of independent variables, p;

7. Multiply all independent variables, Xi, (from step 6) by Ti to

form the interaction effects, Xi*Ti;

8. Choose a data mining or statistical technique for supervised

learning;

9. Fit a model using Yi as the dependent variable and Xi, Ti, and

Xi*Ti as independent variables;

10. Use stepwise procedure (or similar model selection

procedure) to determine the best parsimonious model.

After the best model is selected, we propose the following

procedure for validation using the holdout sample:

1. For each individual in the hold-out sample, compute the

predicted values of expected Yi for both the treatment and

control, i.e. predict E(Yi|Xi;treatment) and E(Yi|Xi;control);

2. Subtract the control value from the treatment value to

estimate the treatment and control difference (in order to

achieve objective (3));

3. Rank and decile the entire hold-out sample by the predicted

difference;

4. In each decile, compute the observed mean value of Yi’s in

the treatment group and the observed mean value of Yi’s in

the control group and then take the observed difference;

5. Plot the observed difference between treatment and control

by decile to validate the model;

6. The expected true lift can be measured by how much the top

decile(s) perform better than random using the observed

treatment and control difference from step 6."

From The True Lift Model - A Novel Data Mining Approach to

Response Modeling in Database Marketing

Victor S.Y. Lo

0

## Answers

3Contributor ILooks like I will have to try to answer my own question . There is no general agreement on what to call analyse of difference between control group and target group in datamining. Some of the terms are

* uplift modelling

* differential response analysis

* incremental modelling

* incremental impact modelling

* true response modelling

* true lift modelling

* proportional hazards modelling

* net modelling.

There is an FAQ on uplift modelling at http://scientificmarketer.com/2007/09/uplift-modelling-faq.html

"Using Control Groups to Target on Predicted Lift:" (2007) gives an overview of the techniques.

http://www.portraitsoftware.com/?a=10399

Jamie Beerbower

Hochheim am Main

1,751RM FounderI have scanned through your description in your first post and as far as I can see you can setup all steps with RapidMiner if this is any help of you. You however have to use some FeatureGeneration operators, several example filterings and one or two merges. For the average calculation in the deciles, you would have to use a discretization together with an aggregation operator.

Maybe this encourages you to set up the whole process but be aware that it will get, well, a bit complex

Cheers,

Ingo

347MavenThank you for the links . This is a really interesting topic.

I want to add this one: http://en.wikipedia.org/wiki/Uplift_modelling, primarily for the terms and groups in customer segmentation.

So little time, so much to learn...

greetings

Steffen

3Contributor IIngo, thanks for taking the trouble of checking whether Dr. Lo's battle plan is feasible. Before I go ahead and implement it I need to take a look at the other strategies and (most importantly) get my head around the quality measurement strategy.

Practically I doubt any new interesting hypothesis can come from the process -- we (and most everyone else in the world) simply don't have the quantitiy of data to look at more than one dimension (one factor) at a time. The difference between "treated" and "untreated" is simply too small in the advertising world.

viele grüße

Jamie Beerbower

Hochheim am Main

1Contributor II found the link which u posted very useful.

Thanks for sharing....