Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Repeated survey - how best to merge and analyze data

    Hello,

    I have data derived from surveys distributed at two waves. In wave 1, there were 150 individuals, and in wave 2, there were 120 individuals. Among these, approximately 80 individuals participated in both waves, while the rest were unique respondents who answered only in wave 1 or only in wave 2. Treatments were distributed between wave 1 and 2, with some individuals receiving the treatment and others not receiving it(Those who received the treatment in Wave 1 did not receive it in Wave 2).

    Now, I'm wondering how best to merge and analyze the data, considering the time lag. I'm unsure whether the xtmixed Stata function might be appropriate for this analysis. My aim is to determine whether exposure to treatment influences participation in support of a program.

    Thank you for your help.

  • #2
    On the assumption that the two waves of the survey asked the same questions (or almost entirely so), you will not want to -merge- these at all, you will want to -append- them, to create a long, longitudinal data set.

    In terms of choice of analysis, much depends on things that you have not described. Was the treatment assignment randomized? For those who were in both waves, was the decision of which wave to offer treatment randomized? What, if anything, do we know about the 70 people in wave 1 who did not continue, and the 40 newcomers in wave 2. How do they differ from the people who participated in both waves?

    If the study design had been simpler: 150 participants enrolled, and treatment assigned to each one randomly to be given either during wave 1 or wave 2, and if the number of dropouts was small, then treatment would be a purely within-person effect. As such, your best analysis would then be -xtreg, fe-. (Or some other suitable -xt...,fe- command depending on the nature of the outcome variable.) If you wanted to also look at the collateral effects of other variables, due to the randomization, you could also use -xtreg, re- (or -mixed-: same model, different estimator.) But even in this best case scenario, you need to include among your explanatory variables not just treated vs untreated but also a variable indicating the order of treatments and its interaction with the treatment variable, because in a situation like this the order itself may be an important predictor of the outcome (especially if the effects of being treated in wave 1 persist into wave 2 and thereby "contaminate" the purportedly-untreated status in wave 2 of those who were given treatment in wave 1.)

    But you have a large number of participants who were only in one wave. The -xtreg, fe- estimator would disregard them in analysis because they are singleton observations. And there could be substantial bias that results from throwing out so many (the majority!) participants. That's why we need to know more about them: who are they, and why did they not participate in both waves?

    If the treatment assignment was not randomized, things are even worse. You can do a treated vs untreated analysis of the entire corpus using -xtreg, re-, but it would be an enormous leap of faith to think of the results as an unbiased estimate of the causal effect of treatment.

    In short, this data design sounds like it may not be well suited to the task of estimating the treatment effect. Depending on the details I've asked about, it might be better than it appears at first glance. With additional information it might be possible to provide a somewhat better analysis and a somewhat more optimistic outlook.
    Last edited by Clyde Schechter; 31 Jul 2023, 09:07.

    Comment


    • #3

      Thank you so much, Clyde, for your time and constructive comments.

      Regarding your questions,
      The assignment of treatment was randomized. The study/questionnaire was initially designed to be conducted in one wave. However, a decision has been made to run a second wave. This time, the only condition is that those who were part of the treatment group in the first wave will be excluded from participating in the treatment group again. Aside from this change, everything else has remained similar to the first wave.

      Thanks again!

      Comment


      • #4
        Well, the randomization makes things a lot better. But there remains the problem of the 70 people who did not continue to the second wave and the 40 new recruits in that wave. If, ideally, the 70 people who did not continue were a random sample of the original 150, and the 40 new recruits in wave 2 were a random sample of the same population the original 150 were recruited from, then we could with great confidence just analyze the two appended data sets using -xtreg, re- (or some other -xt..., re- command* that is suitable to the nature of the outcome variable, with treatment status, and order of treatment assignment and their interaction as the predictors. Ideally the interaction coefficient would turn out negligible, and the coefficient of the treatment variable would give you a reasonable estimate of the treatment effect.

        If you have information to support those assumptions, then that would be the only analysis you need to do. But it would surprise me if those assumptions are really true. So I think you need to try some alternative analysis as well. Probably the most important such analysis would be to analyze only the 80 people who participated in both waves using -xtreg, fe- (or -xt..., fe- as appropriate to the nature of the outcome variable). If the treatment coefficient comes out reasonably similar to the one in the -xt..., re- analysis of the full data set, then that suggests that the change in composition of the sample between waves didn't really make much difference and the findings might be robust to this issue. If the coefficients of those analyses differ appreciably, then at least you have an estimate for the magnitude and direction of the bias resulting from attrition and replacement.

        If you have other information about the participants in your data set, you might also see if you can fit a reasonable model of drop-out and a reasonable model of being a replacement. You then might be able to use those models to develop inverse-probability weights for being a participant in the second wave and apply those as pweights to the random-effects model in the hope that this will reduce the bias from attrition and replacement. (To do this you will have to go to -mixed- or one of the -me- suite because the -xt- commands do not support weights. This is not a problem: as long as no random slopes are involved and we have only two levels, the -me- models are the same as the -xt- models--they are just estimated differently.) This is fairly complicated. If you end up going down this path and need help with coding, be sure to show example data, using the -dataex- command, and explain the situation carefully when you post.

        *In #1 you refer to -xtmixed-. That command was renamed -mixed- several versions back. While the name -xtmixed- is still accepted, and invokes the -mixed- command, some day that name might no longer be recognized, so best to start calling it by its current name. Also, to be clear, when I refer to -xt- commands, -xtmixed- is not among them. I'm referring to commands like -xtreg-, -xtlogit-, -xtpoisson-, etc. that are applicable only to two-level models, do not allow random slopes, and usually support both fixed- and random-effects modeling of the top level. The more flexible -me- commands, like -mixed-, -melogit-, -mepoisson-, etc. allow multilevel models of all depths and also support random slopes as well as intercepts. But there are no fixed-effects models with more than 2 levels.

        Comment


        • #5
          Super informative and helpful! Thank you so much

          Comment

          Working...
          X