Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Propensity score matching (psmatch2) for clustered data

    Hello everyone,

    I'm trying to implement propensity score matching (PSM) in a situation of clustered data. In brief, an intervention was implemented in socio-economically different study sites. For each of these study sites, I have a (fairly large) pool of individuals from which I would like to select appropriate control individuals using PSM.

    The issues I'm facing are touched on in an older entry on here but weren't resolved (https://www.statalist.org/forums/for...-with-psmatch2).

    I'm implementing PSM with psmatch2 in a way referred to as the 'within approach' by Bruno Arpino in this presentation and related work: https://www.stata.com/meeting/spain1...n18_Arpino.pdf. Essentially it loops through the study sites to get perfect balance on study site-level characteristics. The (simplified) code looks like this:

    Code:
    * Obtain propsensity scores
    logistic treatment x1 x2 x3
    predict pscore, pr
    
    * Set caliper to 0.25xSD of PS
    scalar cal = r(sd)*0.25
    
    * Execute PSM per site
    gen weight = . 
    gen att = . 
    levels site, local(slist)
    foreach s in `slist' {
    
    psmatch2 treatment if site == `s', pscore(pscore) caliper(`=scalar(cal)') out(outcome)
    replace weight = _weight if site == `s'    
    replace att = r(att) if site == `s'    
    }
    After PSM, I would like to run a regression adjusting for the fact that individuals are clustered within sites. Here, for a binary outcome, I'm using logistic with cluster-robust standard errors (other approaches may be possible, like GEE):

    Code:
    logistic outcome i.treatment [fweight=weight] if !mi(weight), cluster(site)
    One can show that the absolute differences between treatment and control individuals are estimated to be the same by this regression and psmatch2 with this:

    Code:
    margins i.treatment, pwcompare(effect)
    sum att [fweight=weight] if !mi(weight) & treatment == 1
    However, the SEs of the regression model are inaccurate because they do not take into account that the propensity scores were estimated. I know that in the way I use psmatch2 in the code above the SEs are also not correctly estimated, which could be improved through the ai() option or using teffect psmatch instead. However, this doesn't resolve the issue that I need to have a separate regression to account for the clustering.

    I am able to obtain good balance between treatment and control groups with this approach - and similarly if I just use study site as a variable when estimating propensity scores instead of doing the 'within approach' - but it seems to me that I can only account for the fact that propensity scores are estimated (say, by using teffect) or account for clustering of the data (in a separate regression with cluster-robust SEs), but not both. Now my questions:

    1) Is there a way to adjust SEs for the estimation of propensity scores AND clustering in the data? What are your recommendations in this situation?
    2) If there is no solution for this, what do you think the consequences are? It appears to me that, in my approach with the regression, SEs tend to be overestimated compared to teffects. Can one argue that this makes the regression results conservative?
    3) Unrelated to this problem, something else I was wondering about for this 'within approach': Do you think one should estimate the propensity scores for each site, i.e. have the estimation within the loop, instead of estimating propensity scores for the sample as a whole before the PSM in each site? It probably doesn't matter much if you include study site as a variable when estimating propensity scores and still do the PSM per site.

    I would be very grateful for any insight you can offer! Happy to provide more information on the data or my code but my questions are more about the general approach to this problem.
    Thank you!
    Robin




  • #2
    This is an interesting problem and one for which I doubt that there is a simple, clear, answer. In any case, maybe I can contribute to the discussion on it.

    I would personally think that you should do everything by site (both propensity score matching and OLS) or do everything in a single equation, but include fixed effects at every step of the process. Out of curiosity, was this a cluster-random sample? Or was it an SRS in multiple sites? I found this article very helpful on the topic.

    My gut instinct would be to calculate the propensity score for the entire sample, including fixed effects by site rather than calculate it separately for each site. If your site numbers are not to great, you could also interact any variables that you think would have influenced participation in the program that varied by site. Then running your weighted regression both with FE and cl standard errors.

    Hope someone else more knowledgeable can add to this interesting question.

    Comment


    • #3
      Thanks for your input, Jonathan. That's an an excellent article indeed!

      The data come from a cluster-randomised trial that examined the effects of cash transfers on children's school attendance and health outcomes. We have a situation that a separate survey was conducted in the same communities around the same time, involving the same individuals as the trial, so we have considerable more information available and can examine other potential outcomes of the intervention (the cash was given to the household and not conditioned on child outcomes*, so could have a range of effects). However, we're facing limitations in sample size in the original trial, so we're exploring possibilities to increase our sample size by utilising the fact that a much larger number of people participated in the survey but not in the trial, hence can be considered 'control' (=no intervention). These potential 'additional' individuals from the same communities would be matched to those participating in the trial. In a way, we're dealing with a 'hybrid' of an original and a synthetic control group. But in any case the treatment was assigned on cluster-level, which is why we want to adjust for the clustering (as the article describes).

      I agree with you that I would calculate the PS for the entire sample because it seems to produce well-balanced groups. I'm currently planning to implement both approaches, i.e. estimate treatment effects with 'teffects', not accounting for the clustering, and account for the clustering in separate regressions. And then compare these (in addition to running analyses on the original trial sample).

      * The trial was in fact set up in a 1:1:1 way for control and two treatments: unconditional cash and cash conditional on child outcomes. However, the conditions were 'weak' because there was an initial period during which there was 'lenience' and the trial was terminated early, so the conditional cash intervention ended up being very similar to the unconditional cash. That's why one can consider both together as 'treatment', but that means that there is in the original data a 1:2 control:treatment sample.

      Comment


      • #4
        I'm not sure it's really the case that "the SEs of the regression model are inaccurate because they do not take into account that the propensity scores were estimated." The fact that the propensity scores are estimated affects the SEs of the matched sample (hence the Abadie-Imbens calculation of SEs), not the SEs of the regression.

        Please see the explanation of the "first match, then regress on matched sample" approach and supporting references in
        https://www.statalist.org/forums/for...ion-covariates .
        David Radwin
        Senior Researcher, California Competes
        californiacompetes.org
        Pronouns: He/Him

        Comment


        • #5
          Thanks, David. Those are some good references!

          Comment

          Working...
          X