Determining the adequacy of a propensity score for matching participants

Melissa Garrido

Join Date: Apr 2014

Posts: 75
#1

Determining the adequacy of a propensity score for matching participants

27 Oct 2014, 14:08

I received the following private question about matching; I’m posting the question and my reply here so that others may weigh in.

Dear Melissa,

I’m writing private message rather than the public one as the focus of my question is much more on the matching itself, rather than on Stata commands for matching. I was wondering if you have a bit of free time to help me in clarifying some of (for me rather complicated) issues I describe below?

I’m trying to test 2 hypotheses (H1 and H2 below) to see whether 2 spatially separated participatory programs had an effect on attitudes and knowledge of local people. I’m measuring ATT and my outcomes are both categorical (attitudes on scale from 1 to 5) and continuous (knowledge: 0 to 8 points).

Here are the brief details of the hypotheses.
H1: compare effects of participation in any of the two programs v.s. non-participants.
H2: compare effects of participation in a program A versus participation in a program B.

Therefore, I have two treatment variable (one for each hypothesis). I also have two different probit models to estimate two different propensity scores (one for each hypothesis). These probit models have different sample sizes as for the second hypothesis I use only a sub-sample of participants and Im looking at the differences inside that subsample.

Im using user-written commands psmatch2 and pscore on Stata 12.

Here are my doubts:

1) For testing H1, I have 93 non-participants and 210 participants. Do you think matching is the way to go here, as I have comparatively very small number of controls?
*** Matching may or may not work for you; it will depend on how many of your non-participants and participants are within the range of common support. If you end up with a very small fraction of your non-participants being used as matches for your participants, you might want to consider other methods of analysis.

2) If yes, which type of matching would be theoretically the best in this situation? I’m using kernel and nearest neighbor matching with replacement and with 3 neighbours as these two have the best matching quality (evaluated with the pstest command). Here is the command line:
psmatch2 treat_V VfordistN tr_wakillN2 sc_viltrustN1 hh_headEDUY hh_waterN tr_compN district pw1 satmobtv, outcome (trknowledge tigerlake bd_otherwildlifelikeN bd_forestlikeY) com n(3)
*** There is no one “best” type of matching. I tend to evaluate a few methods and choose the one that minimizes bias without sacrificing sample size.

3) If no, what would be the alternative in order to get rid of selection bias/have causal inference?
*** If you have multiple time points of data, you might consider a difference-in-difference model.

4) With the current propensity score specification, I have 48 treated cases off support. Is that too much in comparison to my overall sample? Should I try to refit my probit model?
*** I would try a different specification of your probit model. Your goal is to get a treatment and comparison group that are roughly equivalent on observed covariates so that you can isolate the effect of your treatment. If you have many off support cases, you will have a very limited population to which you may generalize your results.

5) Since I never came across categorical outcomes in papers that use matching, I’m wondering if it makes sense to measure ATT on the categorical outcomes?
*** Yes, you can use categorical outcomes – the propensity score match process just makes your treatment and comparison group more similar to each other. You can evaluate a variety of outcomes after matching observations.

6) For testing H2: is it OK to use matching as I’m comparing two treatments rather than a treatment and a control?
***Yes, that is fine. An alternative would be to evaluate H1 and H2 within a single model, but you would need Stata 13 for that. See the manual entry on –teffects multivalued-; you can use inverse probability of treatment weighting to do both comparisons within a single model.
Tags: None
bmacura

Join Date: Aug 2014

Posts: 2
#2

28 Oct 2014, 10:30

Dear Melissa,

many thanks for a quick reply and for posting this publicly for me! Here are some clarifications (in caps):

1) For testing H1, I have 93 non-participants and 210 participants. Do you think matching is the way to go here, as I have comparatively very small number of controls?
*** Matching may or may not work for you; it will depend on how many of your non-participants and participants are within the range of common support. If you end up with a very small fraction of your non-participants being used as matches for your participants, you might want to consider other methods of analysis.
- ALL NON-PARTICIPANTS ARE IN THE COMMON SUPPORT REGION, ONLY SOME OF THE PARTICIPANTS ARE OFF SUPPORT (DETAILS UNDER POINT 4.), SO I GUESS MATCHING IS APPLICABLE IN THIS CASE

2) If yes, which type of matching would be theoretically the best in this situation? I’m using kernel and nearest neighbor matching with replacement and with 3 neighbours as these two have the best matching quality (evaluated with the pstest command). Here is the command line:
psmatch2 treat_V VfordistN tr_wakillN2 sc_viltrustN1 hh_headEDUY hh_waterN tr_compN district pw1 satmobtv, outcome (trknowledge tigerlake bd_otherwildlifelikeN bd_forestlikeY) com n(3)
*** There is no one “best” type of matching. I tend to evaluate a few methods and choose the one that minimizes bias without sacrificing sample size.
- THAT IS EXACTLY WHAT I DID – ESTIMATED THE MATCHING QUALITY AND HAVE SELECTED THE MATCHING ALGORITHM WITH THE SMALLEST VARIANCE, SMALLEST BIAS AND THE SMALLEST NUMBER OF CASES THAT WERE NOT IN THE SUPPORT REGION.

3) If no, what would be the alternative in order to get rid of selection bias/have causal inference?
*** If you have multiple time points of data, you might consider a difference-in-difference model.
- I ONLY HAVE A MAGNITUDE OF CHANGE (INCREASE, DECREASE, REMAINED THE SAME), NOT THE EXACT VALUE - SO PERHAPS APPLICATION OF D-IN-D IN MY CASE WOULD NOT MAKE MUCH SENSE?

4) With the current propensity score specification, I have 48 treated cases off support. Is that too much in comparison to my overall sample? Should I try to refit my probit model?
*** I would try a different specification of your probit model. Your goal is to get a treatment and comparison group that are roughly equivalent on observed covariates so that you can isolate the effect of your treatment. If you have many off support cases, you will have a very limited population to which you may generalize your results.
- I SPECIFIED PROBIT MODEL DIFERENTLY AND NOW HAVE ONLY(?) 29 TREATED CASES (out of 213) AND 0 CONTROL CASES OUTSIDE OF THE COMMON SUPPORT. THESE OFF SUPPORT CASES ARE THE ONES WITH THE HIGHEST PROPENSITY SCORE…DO I HAVE TO BE WORRIED HERE (OR IN OTHER WORDS: WHAT EXACTLY MEANS “many off support cases”)?

5) Since I never came across categorical outcomes in papers that use matching, I’m wondering if it makes sense to measure ATT on the categorical outcomes?
*** Yes, you can use categorical outcomes – the propensity score match process just makes your treatment and comparison group more similar to each other. You can evaluate a variety of outcomes after matching observations.
- THANKS!

6) For testing H2: is it OK to use matching as I’m comparing two treatments rather than a treatment and a control?
***Yes, that is fine. An alternative would be to evaluate H1 and H2 within a single model, but you would need Stata 13 for that. See the manual entry on –teffects multivalued-; you can use inverse probability of treatment weighting to do both comparisons within a single model.
- THANKS. I’LL CHECK THE MANUAL

Biljana Macura

Biljana Macura, University of Padova, Italy
Comment
Melissa Garrido

Join Date: Apr 2014

Posts: 75
#3

28 Oct 2014, 12:23

Hi,
Regarding points 1 and 4, I'd be concerned about having so many of your treated individuals off support. With the different propensity score specifications you've tried, you end up with several treated individuals with a very high propensity score (likelihood of treatment given the observed confounders). When so many are off support, this means that there are not adequate matches in the comparison group from which to estimate the counterfactual. Are any of the variables in your propensity score not potential confounders (that is, did you accidentally include variables that are hypothesized to predict treatment but not outcome or variables that occur after the treatment)? If so, you could try removing those variables (which shouldn't be in the propensity score model anyway) to see if you get a larger proportion of your sample within the range of common support. There is a distinct possibility, however, that you just have a dataset where you are not able to compare similar individuals with and without the treatment.

Hope this helps,
Melissa
Comment
bmacura

Join Date: Aug 2014

Posts: 2
#4

29 Oct 2014, 06:13

Thanks Melissa on your very much helpful feedback! I'll try to go back to may model again and re-check my variables according to your suggestion...Best, Biljana

Biljana Macura, University of Padova, Italy
Comment

Announcement

Determining the adequacy of a propensity score for matching participants

Comment

Comment

Comment