Propensity Score Matching

Reinout Heijboer

Join Date: Feb 2017

Posts: 17
#1

Propensity Score Matching

12 Mar 2017, 04:30

Hi,

I'm doing a propensity score matching using the psmatch2 command in STATA.
My cohort consist of 17,435 patient of whom 8,474 (49%) have gotten treatment and 8,961 (51%) have not.

After using the psmatch2 command and nearest neighbor matching (caliper 0.2) I end up with a cohort consisting of only 4,584 patients. So only 26% of my total cohort.
Does anyone know what the problem might be? How much patient do I need minimally after Propensity score matching? I've lost 19 patients due to very high age predicting succes perfectly.

Thanks!
Tags: None
Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#2

12 Mar 2017, 13:44

Reinnout,

After using the psmatch2 command and nearest neighbor matching (caliper 0.2) I end up with a cohort consisting of only 4,584 patients. So only 26% of my total cohort.
Does anyone know what the problem might be?

Without further knowledge about your data, it is hard to tell what exactly the problem is. One likely problem, however, is that you do not have good matches for all of your treatment patients. A way to find out whether this is the problem is to simply omit the caliper option.

How much patient do I need minimally after Propensity score matching?

From a pure statistical perspective, 4,584 observations should be more than enough for the identification of a potential treatment effect. However, the reduction of your sample size is a problem if there is a systemic reason linked to the treatment that excludes the individuals from your sample. When doing nearest-neighbor propensity score matching, it is nothing special that your sample size is reduced because one does typically not need all control observations but only those serving as a good match to the treated observations (at least if your trying to estimate the ATT). Many times, researches doing PSM have a large control group and only a (relatively) small treatment group. Therefore, they can find a good match for every treatment observation. This is clearly not the case with your data. To check whether this is a real problem, I would start by checking if your result changes when you try different calipers (or no caliper at all).
1 like
Comment
Reinout Heijboer

Join Date: Feb 2017

Posts: 17
#3

13 Mar 2017, 04:47

Thank you so mutch for taking the time to answer my question! After checking my data once again I found that one variable was not in formatted correctly. This has made a huge difference in the number of observations!

Now if I set my caliper to 0.01 or 0.2 my cohort drops down to 10,292 (59%) patients. If caliper=0.5 I get 12,408 (71%) and by ommitting caliper I end up with 16,914 (97%) patients.

With caliper set to 0.5 of by ommitting caliper my difference-in-difference estimation graph does not line up well (http://www.statalist.org/forums/forum/general-stata-discussion/general/1145219-psmatch2-graph-for-propensity-score-matching graph as shown in this thread). With caliper set to 0.5 the bar chart using psgraph does look OK.

It does not feel right to set the caliper to a random number. Literature states that caliper 0.2 is mostly ideal. Might 0.01 be better in this case?
Comment
David Radwin

Join Date: Mar 2014

Posts: 369
#4

13 Mar 2017, 12:44

There is no consensus on setting the size of the caliper (or even whether to use one). Another convention, which might seem less arbitrary to you, is to use a caliper of one-quarter of a standard deviation of the propensity score.

Stuart, E.A., and Rubin, D.B. (2008). Best Practices in Quasi-Experimental Designs: Matching Methods for Causal Inference. In J.W. Osborne (Ed.), Best Practices in Quantitative Methods (pp. 155–176). Thousand Oaks, CA: Sage.

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
Comment
kenwald

Join Date: May 2014

Posts: 13
#5

14 Mar 2017, 18:46

I have a different problem than Reinout. I ran Psmatch2 on a population with 1215 observations (people) and obtained matches for 96% of the cases. Almost all the dropped cases appeared to violate the overlap assumption. Here's the command:

psmatch2 jewish age marital raised yrslived reduc paeduc maeduc roccstat hhoccstat faminc classid chclassid sex race region areatype areasize5 , noreplacement

I suspect most people would kill for that kind of success but I'm wondering if it's too good to be true. My treatment variable is quite skewed with only 14% of cases (n=174) in the treatment group. I had 17 predictors. I experimented with different forms of matching and turned replacement on and off and also varied the caliper. Not much changed. I also seeded. I would try to replicate this with the teffects or CEM routines but I couldn't get either to run.

The balance improved dramatically after matching, with the mean bias dropping from 32 to 5 and with none of the predictors exhibiting a statistically significant difference between the treatment and control groups.

That said, I would like to match my treatment group with with a matched control group of the same size. I can't find anything in the documentation that tells me how to do this. I'd appreciate any suggestions.

Thanks.
Comment
David Radwin

Join Date: Mar 2014

Posts: 369
#6

14 Mar 2017, 23:48

You might not want to be too pleased with the diagnostic that no differences after matching were statistically significant. As Imai, King, and Stuart (2008) point out, a weakness of the t-test as measure of balance for matching is that the t distribution is sensitive to sample size. Even randomly dropping enough observations will eventually give you a insignificant t-test.

Imai, K., King, G., and Stuart, E.A. (2008). Misunderstandings Between Experimentalists and Observationalists About Causal Inference. Journal of the Royal Statistical Society: Series A, 171(2): 481–502.

Last edited by David Radwin; 14 Mar 2017, 23:56.

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
Comment
kenwald

Join Date: May 2014

Posts: 13
#7

15 Mar 2017, 08:48

That's why I'd like to compare the treatment group to a control group of the same size but I can't find the command to do that. If it's in the documentation, it's not clear to me so help is appreciated.
Comment

David Radwin

Join Date: Mar 2014
Posts: 369

15 Mar 2017, 12:51

Again, I'm not sure that this is the right approach, but I think psmatch2 (Leuven and Sianesi, available from SSC) applies 1:1 matching when you specify noreplacement, so you already have a control group of the same size. Here is a simple example.

Code:

. sysuse nlsw88, clear
(NLSW, 1988 extract)

. psmatch2 never_marr age grade race, outcome(south) noreplace

[output suppressed]

. tab _weight _treated

 psmatch2: |
 weight of |  psmatch2: Treatment
   matched |      assignment
  controls | Untreated    Treated |     Total
-----------+----------------------+----------
         1 |       234        234 |       468
-----------+----------------------+----------
     Total |       234        234 |       468

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him

Comment

Reinout Heijboer

Join Date: Feb 2017

Posts: 17
#9

16 Mar 2017, 06:14

Originally posted by David Radwin View Post

Another convention, which might seem less arbitrary to you, is to use a caliper of one-quarter of a standard deviation of the propensity score..

How do I estimate the SD of the propensity score? By using logit treatment independentvariables?
Comment

David Radwin

Join Date: Mar 2014
Posts: 369

#10

16 Mar 2017, 10:14

Yes. Another silly example:

Code:

. sysuse nlsw88
(NLSW, 1988 extract)

. logit south never_married age grade race

Iteration 0:   log likelihood = -1526.0955  
Iteration 1:   log likelihood = -1466.6225  
Iteration 2:   log likelihood = -1466.5559  
Iteration 3:   log likelihood = -1466.5559  

Logistic regression                             Number of obs     =      2,244
                                                LR chi2(4)        =     119.08
                                                Prob > chi2       =     0.0000
Log likelihood = -1466.5559                     Pseudo R2         =     0.0390

-------------------------------------------------------------------------------
        south |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
never_married |  -.3487969   .1495241    -2.33   0.020    -.6418588   -.0557351
          age |   .0096247    .014428     0.67   0.505    -.0186536     .037903
        grade |  -.0507879   .0178795    -2.84   0.005    -.0858311   -.0157446
         race |   .9283734   .0954632     9.72   0.000      .741269    1.115478
        _cons |  -1.200636   .6449389    -1.86   0.063    -2.464693    .0634208
-------------------------------------------------------------------------------

. predict pscore
(option pr assumed; Pr(south))
(2 missing values generated)

. summarize pscore

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      pscore |      2,244    .4193405    .1135183    .231764   .8806233

. local caliper = sqrt(r(sd))/4

. display `caliper'
.08423118

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him

Comment

Reinout Heijboer

Join Date: Feb 2017

Posts: 17
#11

16 Mar 2017, 10:36

Incredible, thank you so much!
Comment
Cesar Augusto

Join Date: Nov 2015

Posts: 7
#12

30 May 2017, 15:37

Originally posted by David Radwin View Post

Yes. Another silly example:

Code:

. local caliper = sqrt(r(sd))/4 . display `caliper' .08423118

I understand Austin (2011) recommend a caliper that equals 0.2 (and Rosenbaum and Rubin (1985) recommend a caliper that's 0.25) of the standard deviation of the LOGIT of the propensity score.

Garrido et al. (2014) calculate the caliper as follows (see footnote 1):
gen logitpscore = ln(mypscore/(1-mypscore))
However, when I calculate the logit of the propensity score as suggested by Garrido et al. (2014), the standard deviation is 1.56663, and the caliper is: 0.2sd = 0.313326, and 0.25sd = 0.3916575, both of which are very high and the results from the matched sample have a large bias (judged by Rubins' B (%) and R)

Looking at the numbers from your example, 0.08423118 doesn't seem to be 20% of the estimated standard deviation (0.1135183), it's more like 74% of it. I'm confused as to why you are calculating the caliper as "sqrt(r(sd))/4". Could you please elaborate a little?

Also, should this caliper be calculated before or after dropping observations outside of the region of common support as identified by the user-written Stata command -pscore-?

Thank you in advance for your time!

Last edited by Cesar Augusto; 30 May 2017, 15:44.
Comment
David Radwin

Join Date: Mar 2014

Posts: 369
#13

31 May 2017, 10:49

The caliper value calculated in that example is not supposed to be 20% of the standard deviation of the propensity scores. It's supposed to be 25% of the square root of the standard deviation of the propensity scores.

As I wrote earlier, "There is no consensus on setting the size of the caliper. . . ." So it is not surprising that a different formula from a different source yields a different caliper value. I don't think this particular method is right or wrong, but merely that it is another convention.

I'll break down the steps of this particular formula (without worrying about significant digits):

1. SD = .1135183
2. sqrt(.1135183) = 0.3369247690508966
3. 0.3369247690508966/4 = 0.0842311922627241

I don't know how pscore determines the area of common support or whether to calculate the caliper using observations outside common support or not. I don't recall any discussion of this in the literature. The silly example I presented does include observations outside common support (if there are any).

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
Comment
Cesar Augusto

Join Date: Nov 2015

Posts: 7
#14

31 May 2017, 13:31

Originally posted by David Radwin View Post

The caliper value calculated in that example is not supposed to be 20% of the standard deviation of the propensity scores. It's supposed to be 25% of the square root of the standard deviation of the propensity scores.

As I wrote earlier, "There is no consensus on setting the size of the caliper. . . ." So it is not surprising that a different formula from a different source yields a different caliper value. I don't think this particular method is right or wrong, but merely that it is another convention.

I'll break down the steps of this particular formula (without worrying about significant digits):

1. SD = .1135183
2. sqrt(.1135183) = 0.3369247690508966
3. 0.3369247690508966/4 = 0.0842311922627241

I don't know how pscore determines the area of common support or whether to calculate the caliper using observations outside common support or not. I don't recall any discussion of this in the literature. The silly example I presented does include observations outside common support (if there are any).

Originally posted by Reinout Heijboer View Post

How do I estimate the SD of the propensity score? By using logit treatment independentvariables?

Thank you for the response, David!

It makes sense now. I was confused about what you were doing in that example since you mentioned "one-quarter of a standard deviation of the propensity score" in a previous post (also, I meant to write 25%, not 20% hehe), but it is clear now.

With regards to estimating the caliper only with those (or not) observations in the region of common support, I have also not been able to find any discussion of this in the literature. It seems like an important step though. I'll reply to this post if I'm able to find an answer to this question.

Thank you again for the response!

Best,

Cesar Augusto

Last edited by Cesar Augusto; 31 May 2017, 13:33.
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#15

31 May 2017, 13:42

Given that you have equal numbers of cases and controls to start with, I would suggest that you use inverse propensity score weighting instead of matching, if your analysis allows. Matching may be a good strategy when you have a very large number of potential controls relative to cases - you can find a control group that looks like your treatment group. But if there are equal numbers of each, you will typically end up trading off number of cases for match closeness, as you've learned above. Meanwhile, simulation studies have shown that the methods are equally effective in reducing bias (see eg https://www.ncbi.nlm.nih.gov/pubmed/19684288)

hth,
Jeph
Comment

Announcement