The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

How to make comparisons between treatment and control groups using "Data to Similarity"?

bshockleybshockley Member Posts: 4 Contributor I
Hi Rapidminers,

I don't have an implementation with XML code yet, but I'm working on applying models that deal with treatment effects, for example the effect of drinking wine on lifespan, or the effect of higher ticket prices on attendance at a concert. To do this I'd like to create a kind of matched sample where I match each example from the control group to its most similar example from the treatment group, and compare their outcomes.

Assume computation time is not a concern for the solution.

Use Data Similarity to identify most similar examples from the treatment and control group, and then apply Generate Attribute to calculate the difference between two values (like lifespan, sales, etc.). Ultimately, the objective is to estimate a treatment effect for each individual.

I know Data to Similarity will provide me with the most similar examples in the example set, but from there I'm trying to determine:
  1. How can I find the most similar match from another group? (i.e., if I take one example from the control group, what is the most similar example from the treatment group)
  2. Once I have each pair of IDs how can I put this back into the process to make comparisons within each pair?

This is my first post. Sorry if anything is unclear; I'm happy to provide more detail if helpful.



  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    According to classical statistical principles, matched pair structure in design of experiments is typically done only when you have two identical observations at two different points in time, or in similar circumstances such as in twin studies.  So why do you need to do the pairing of individual records if you are simply trying to estimate the effect of a treatment overall?  As long as you have sufficient numbers of both and the same is constructed in a random, representative way, according to classical statistical theory you should simply directly attempt to estimate the effects in question through traditional methods such as ANOVA or multivariate models such as regression analysis. All of this can be done directly in RapidMiner without the additional step of creating matched pairs.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    bshockleybshockley Member Posts: 4 Contributor I
    edited May 2020
    Hi Telcontar,

    Thanks for your reply. I'm working on a causal inference problem to estimate individual treatment effects and this requires me to calculate a "counterfactual" for each treatment observation. The method I'm using is explained in Athey, Imbens, and Ramachandra (2015); there's an implementation that uses matching for the estimate of the "what if" scenario.

    There's also a decision tree method that's closer to the approach you described, but I'm interested in comparing both to see how they differ.

Sign In or Register to comment.