Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Propensity score matching for a longitudinal datafile with psmatch2

    Dear Stata users,

    I have a question about propensity score matching for a longitudinal datafile with a time-varying treatment variable and time-constant (for instance gender, background status) and time-varying matching variables (for instance age, but also a neighbourhood deprivation score that varies per year)

    I have access to a long-format datafile (2005-2011) with yearly administrative data (residential, demographic, socioeconomic information) on almost 14.500 individuals. About 500 individuals are relocatees, forced to move out of their dwellings in highly deprived neighbourhoods due to urban restructurering policies. The rest of the individuals in the data are main tenants of addresses in the same city (and sometimes even same deprived neighbourhood) that were not subject of an urban renewal program. The 500 relocatees received substantial financial compensation for moving costs and a priority position on social housing waiting lists and got to choose from a wide range of social housing dwellings across the city.

    I want to match these 500 relocatees to 500 comparable residents and investigate whether being subject to an urban renewal policy has been beneficial to relocatees (moving to a more affluent neighbourhood and increased socioeconomic opportunities) compared to their counterparts that were not forced to relocate as their dwellings were not being demolished (but could still voluntary move, although they did not get assistance and compensation for moving out of their dwellings).

    I chose to match an individual based on its characteristics the year before treatment (so for person ID 1 in the table below for the year 2006) with an individual from the control group with similar characteristics in the same year. I followed this example by http://www.stata.com/statalist/archi.../msg00073.html for force an exact match:

    probit treatment gender age deprivation_index couple kids, cluster(ID)

    *** The predicted probability is calculated for the year before for the treatment for the treated group (in the example above this
    *** is 2006, but it could also be 2005, 2007, 2008 or 2009)
    *** and calculated for all years for the control group (so we can search for a match among all years of the control group)

    predict double pscore if (esample) & (yearbeforetreatment=1 | treatment =0)

    set seed 123456
    gen u=uniform( )
    sort u

    ** To force the exact match, I add the year to the p-scores:

    gen pscore2=year+pscore

    See an example of my data and estimated pscores in the table below (the values are fictional, as the data is highly confidential)
    Year Person ID Adres ID Treatment Year before treatment Gender Age Deprivation Score Couple Kids pscore pscore2
    2005 1 78103 0 0 M 32 2.13 1 0 .
    2006 1 78103 0 1 M 33 2.17 1 0 0.013 2006.013
    2007 1 66405 1 0 M 34 0.45 1 0 .
    2008 1 66405 0 0 M 35 0.48 1 0 .
    2009 1 66405 0 0 M 36 0.42 1 1
    2010 1 53020 0 0 M 37 1.22 1 1
    2011 1 53020 0 0 M 38 1.18 1 1
    2005 2 11401 0 0 F 44 1.83 1 1 0.022 2005.022
    2006 2 11401 0 0 F 45 1.87 1 1 0.021 2006.021
    2007 2 11401 0 0 F 46 1.88 1 1 0.025 2007.025
    2008 2 11401 0 0 F 47 1.84 1 1 0.026 2008.026
    2009 2 11401 0 0 F 48 1.90 1 1 0.027 2009.027
    2010 2 90622 0 0 F 49 0.98 1 1 0.023 2010.023
    2011 2 90622 0 0 F 50 0.96 1 1 0.027 2011.027
    Then I used the psmatch2 command to make exact matches.

    psmatch2 treatment, pscore(pscore2) noreplacement neighbor(1) common caliper(0.01)

    This forces one treatment to be matched with a control person in the same year. What it also sometimes does, however, is to match two treated individuals to only one individual in the control group in different years. So treated individual A (treatment in 2007, searched for match on year before treatment 2006) and treated individual B (treatment in 2009, searched for match in 2008) are matched to individual C in control group in both 2006 and in 2008. This is due to the nature of the data: individual C also changes over time (due to a change in neighbourhood deprivation index, change in household status et cetera, so in 2008 the observation of this person is a good match for person B, while in 2006 it was a good match for person A).

    We have enough control individuals (about 14.000) so we do not want that one individual in the control group is used twice. The ‘noreplacement’ option is of no use here, a comparison observation is not used as a match more than one time, but because I use person-year data, the same comparison individual is used as a match more than once… Does anybody know how to restrict Stata to only use one comparison individual in longitudinal data?

    Furthermore, any other comments and suggestions regarding my matching procedure are very much welcome!

    Thanks in advance,

    Emily

  • #2
    If you could restructure your query to focus not on what have, but a) what format your data is in; and b) what you would like as a result, that would make it easier to help you. As is, it takes more effort than I (and perhaps others) might want to expend to figure out *what you want.*

    My rule of thumb is that if I don't have an idea of what a questioner wants by the 3rd sentence, I probably don't want to go further. A look through the archives will probably show that I've failed at the

    That being said: To achieve a no replacement match (which I think is at the essence of your question), I'd consider putting the data into wide format for the purposes of finding the matches, then going back to long format as necessary for the propensity score analysis. Since you only have 7 years, wide format should not be a huge problem.

    Regards, Mike

    Comment


    • #3
      Dear Mike Lacy,

      Thanks very much for your response and feedback! (and my apologies for the long introduction, it is my first post).

      The problem in a nutshell: the treatment is time-varying and this makes it - if I understand the procedure correctly - (1) difficult to match the treated to comparison individuals based on pretreatment characteristics and (2) difficult to calculate pre- and post-treatment outcome differences to estimate a difference in difference model.

      About my data and the methods/models I would like to apply; I have a yearly panel data from 2005-2011 and I would like to match each relocatee to a comparison individual that was not forced to move, based on the relocatees' pre-treatment characteristics (the treatment is the forced move - and financial compensation and assistance in finding a new dwelling - and this takes place in a different year for each relocatee). After the match, I would like to estimate a difference in difference model to estimate whether the treatment has been
      beneficial to relocatees (an increased income or better neighbourhood, for instance, compared to their counterparts that were not forced to relocate).


      You are right, wide format is definitely feasible, so that would be then be something like the following:
      Code:
      probit treatment gender age05 ... age11 deprivation_index05 ...deprivation_index11 income05...income11...kids05...kids11
      
      predict pscore
      However, and here rises my concern: the comparison group is then not only matched on pretreatment characteristics of the treated group but also on post-treatment characteristics (as treatment is time-varying I cannot estimate a probit-model including pre-treatment characteristics for years that are applicable to all treated individuals. For instance, for some individuals the year 2007 is pre-treatment, for others this is post-treatment). The fact that all years of each individual are included to match could be problematic as the income variables for certain years (I want to use pre-treatment income to predict getting the treatment) are also (part of) my outcome variable.

      Furthermore, when it comes to the difference in difference design,
      it also gets a bit more complicated. For the difference in difference model I need to calculate the difference in income levels.

      Code:
       psmatch2 treated, outcome(difference_income) noreplacement neighbor(1) common caliper(0.01)

      When there is a fixed treatment year this is easy, but in my case the treatment year is time-varying: for an individual A that was forced to move in 2007, I would like to focus on the difference between income in 2006 and 2011, and for individual B that had the treatment in 2009 this is the difference in income between 2008 and 2011. For the comparison individuals, matched to the relocatees, I would like to use the same income difference (so 2011-2006 for the match of individual A, 2011-2008 for the match of individual B).

      I am thus not quite sure how to match with a time-varying treatment (as we want to use pre-treatment characteristics to match) and how to calculate the differences in outcome variables for the same years for the treatment and control group. The way I dealt with it so far - calculating predicted probability in the long format dataset and forcing an exact match per year - also runs into problems as I tried to explain in the first post: it matches two treated individuals to different years of one individual in the control group (and I am also not sure whether that is the way to go).

      Hoping you (or others) can help me out with this problem.

      Thanks again,


      Emily

      Comment


      • #4
        I can appreciate the complexity of your problem, and your new explanation helps, although I'm still not entirely clear about how you want the matches to occur. To simplify matters: Suppose you only wanted to match on the number of children an individual had. Suppose that one treated individual had 1 child from years 1/3, and then had 2 children for years 4/5, and then had the treatment in year 6. What child-having pattern would you want the matched subject to have?

        Comment


        • #5
          I think you want to do two things: first, compute a measure called "distance from the treatment" and then match people for the five years leading up to treatment. (The downside is that this could be a selected sample still since the treatment is not exogenous, but a result of the actions leading up to it.) This "distance from treatment" is how you deal with staggered treatments. The second approach would just be to cut off a subset of years and consider only those that everyone has in common before the treatment, or break the population off into groups with similar treatments.

          You might also want to look at Jains Hainsmueller's and Abadie's synthetic control command. The problem with pscore matching is sorting on unobservables, so selling this might be tricky.

          Comment


          • #6
            Thanks both very much. The issue I am still dealing with is the fact that I observe everybody from 2005-2011 but that some have only 1 year up to treatment, and others have 2 up to even 6 years. I read the article by Nielsen & Sheffield (2009) and they discuss collapsing the time-element from the data, taking the average of all time-varying variables in the pre-treatment years. This will circumvent the issue of ‘child-having pattern’ as Mike describes, as it would match individuals on the average number of children in the pre-treatment period and the control period.

            One question that arises here however; over how many years do I average the covariates in the control group? (I also observe individuals in the control group from 2005-2011). I wonder about this because for each treated individual the average of covariates over the pretreatment years is based on different number of years (as explained above, some individuals just have 1 pretreatment year, in that case it is not even an average but just the value of that year, others could have 2 up to 6 years). Matching a treated individual where the average is based on 3 years to an control individual where this average is based on 7 years might lead to bias results?

            I am not familiar with the distance from the treatment approach, any recommendations on literature for this? The approach to cut off a subset of years that everybody has in common before the treatment has indeed crossed my mind as well, this would mean matching on covariates in 2005. One issue that arises here, is that some individuals were not yet living in the dwelling that was demolished and as I also want to match on the neighbourhood characteristics this might also bias the results. But this might something to look further into.

            Something else I came across and might be of your interest as well; Coarsened Exact Matching by Blackwell et al (2010) as described here http://www.stata-journal.com/article...article=st0176 I was thinking of using this instead of PSM, as it allows for more exact matches with cut-off points.

            Comment


            • #7
              Dear Emily,

              I have the similar problem as you do. Were you able to resolve it?

              Best regards,
              Alberto

              Comment


              • #8
                Dear Emily,

                I have a similar problem with time-varying treatment. (My case is privatization)
                Thank you for your question and your brilliant suggestion on year forced matching.

                I am not sure whether it does help or not, but I have some ideas:

                Firstly, please define the outcome very precisely. Browse the data to check whether your outcomes is in your expectation or not. For DID, the outcomes are generally the difference of some outcome measurements before and after the treatment. Please check both of treatment and non treatment group.

                Secondly, when you generate the new propensity score, I suggest to save them in "double" format which is more precise. The default is float type, it will round your PS score and reduce the precise of the match.

                Good luck!












                Comment

                Working...
                X