No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • propensity score matching on hospital data with multiple observations per person

    Hello All,

    I am trying to do a propensity score matching on hospital data with multiple observations per person. My main goal is to estimate the effect of an intervention program on hospital charges. There is a binary variable that indicates whether or not an individual participated in the program. There is one record of hospital charge for each visit a person made to the hospital. There is also a common patient number for all visits coming from the same individual. Values of time-invariant predictors are replicated across the multiple records/visits for each individual.
    If i just estimate the simple logistic regression and estimate the pscores then match the treatment group to a control group (with the psmatch2 command), it will assume that each record is independent which is definitely not the case. I will appreciate any help on how I can go about this.
    Thank you!

  • #2
    Reasonable concerns, but for the wrong reason.

    There is nothing in the theory of propensity score matching that requires the use of a logistic regression or any other specific procedure to generate the propensity scores. All that matters is that the propensity scores are well calibrated estimates of the probability that a unit in the analysis belongs to the treatment group. Any procedure that does this well is admissible for the purpose. In particular, it does not matter whether that procedure generates correct p-values, or even generates any p-values or other inferential statistics at all. So independence of observations is really beside the point.

    Now, in your situation, there are some special considerations. You have longitudinal data. But, if I understand your description of the study correctly, each person retains the same group assignment during all observations. Otherwise put, treatment status is defined at the person level, not the visit level. For that reason, your propensity model has to operate at the person level as well. It can incorporate data from multiple visits if you like, but what you do not want to do is generate a different propensity score for the same person on different visits. A simple logistic regression analysis on the longitudinal data set is unlikely to live within that constraint. So it is probably preferable to calculate propensity scores from a reduced data set that has just one observation per person: the propensity predictors can be derived from the multiple observations in whatever way seems most reasonable. In some instances summary statistics for those variables might be appropriate; in other instances it might make sense to have the data in wide shape so that each instance of the predictor gets its own say in the matter. Alternatively, you can build a propensity model using longitudinal analysis techniques: melogit would be an example. But then you will need to synthesize from those results a single propensity score for each person. The good news is that you have many options available, so that something is likely to work well. The bad news is that there are many options and sorting them out may prove difficult and time-consuming.

    At the end of the day, the construction of a good propensity model is as much art as science, I think. It requires patient, persistent effort. The first attempts are seldom satisfactory, and you have to keep modifying the model until you reach good calibration. The more information you have, and incorporate, about factors that actually may influence treatment-control status the better. But there is no simple cut-and-dried formula or approach that works broadly.


    • #3
      Thank you! I initially tried reshaping the data to wide format but I realized that may complicate things for me as I have over 5 million total hospital visits, some individuals had as little as 2 visits while some had over 100 visits. I agree that the different options will be time-consuming. The first option I'll try is to try to estimate summary statistics of the observations per unique individual and I'll see where I go from there.