This is a statistics question for the group. I am looking for recommendations or literature to guide what I should do.
The data I have are panel data for people who come to a clinic, start a treatment, and who are followed up at regular intervals (every 3 months for up to 1 year). The data are observational in nature and have already been collected, so there’s not much I can do about this. In terms of dimensions, I have 5 timepoints (prior to starting treatment and up to 4 follow up visits, and N >> T.
Some people will have missing followup visits, the reasons for which I don’t know. It is not unreasonable given the time frame that some missed visits are due to the pandemic, while others may be informative based on how they perceive the treatment to be working (or not). Some will just miss some followup visits but show up to others. Therefore, there are every type of pattern of missing data possible.
Though the data arise from repeated followup visits, the goal is to infer time to initial response to therapy. For this purpose, at some point I would need to examine the duration in a time-to-event analysis (using discrete-time methods probably since visits are at nominal intervals).
This got me to thinking what are recommended strategies for imputation in this kind of scenario. I am certain that some of the missing data is either random (MAR) due to things like symptom burden, while it may also be not missing at random (MNAR) if the person perceives therapy to be ineffective or effective.
Right now I am considering some sensitivity analyses where I assume best-case/worst-case analysis for the first missing values. These aren’t perfect because they are detain to over or underestimate the hazards, but might be useful to “bracket” the plausible range of estimates.
I would appreciate any thoughts or pointers to literature that considers this type of scenario.
The data I have are panel data for people who come to a clinic, start a treatment, and who are followed up at regular intervals (every 3 months for up to 1 year). The data are observational in nature and have already been collected, so there’s not much I can do about this. In terms of dimensions, I have 5 timepoints (prior to starting treatment and up to 4 follow up visits, and N >> T.
Some people will have missing followup visits, the reasons for which I don’t know. It is not unreasonable given the time frame that some missed visits are due to the pandemic, while others may be informative based on how they perceive the treatment to be working (or not). Some will just miss some followup visits but show up to others. Therefore, there are every type of pattern of missing data possible.
Though the data arise from repeated followup visits, the goal is to infer time to initial response to therapy. For this purpose, at some point I would need to examine the duration in a time-to-event analysis (using discrete-time methods probably since visits are at nominal intervals).
This got me to thinking what are recommended strategies for imputation in this kind of scenario. I am certain that some of the missing data is either random (MAR) due to things like symptom burden, while it may also be not missing at random (MNAR) if the person perceives therapy to be ineffective or effective.
Right now I am considering some sensitivity analyses where I assume best-case/worst-case analysis for the first missing values. These aren’t perfect because they are detain to over or underestimate the hazards, but might be useful to “bracket” the plausible range of estimates.
I would appreciate any thoughts or pointers to literature that considers this type of scenario.
Comment