Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing robustness using bootstrapping and excluding random sample of population

    Dear all,

    I am interested in testing the robustness of my logistic regression results given some uncertainty.

    My output is, for example, yes/no regarding whether or not someone attended a university hospital for a heart attack. I have a dataset including the variables year (2011, 2012), age, sex, economic status, education status, etc. I can link data between 2011 and 2012, but not before 2011 nor after 2012. I am only interested in the first hospitalization, and have de-duped my dataset. In 2012, 10% of “primary cases” weren’t actually primary as they had a primary event in 2011. Therefore, I worry that roughly 10% of individuals hospitalized in 2011 didn’t actually have their first hospitalization in 2011. To make sure my final results are robust even with 10% extra cases, I’d like to exclude 10% of people in 2011 – but I want to randomly exclude them given that I do not know who exactly should not be included in my analysis.

    Initially, I thought to maybe use the command bootstrap. But, my question is, how do I do this in a loop so that I can randomly exclude 10% of the population, but over multiple times so that every time I have a slightly different population...but then can still come up with a final overall odds ratio -- the kind of mean OR. I thought that I could write a program for this, but it doesn’t seem to work given that the e(b) from a logistic regression is not stored as a scalar. It might also be possible to do a sort of meta-analysis of the different odds ratios based on the results using samples excluding different persons. But, this doesn't seem to be the "cleanest" method. This isn't a main analysis, only a sensitivity analysis to make sure that my results wouldn't be affected if the above stated scenario (i.e., extra 10% of people in 2011 without a primary episode) were indeed true.

    Does anyone know of another command? Or has come across a similar issue?

    Thank you for your help!

  • #2
    This approach doesn't make sense to me. Selecting a random sample of 2011 observations to delete does not, in any way I can see, emulate the exclusion of people whose events were secondary rather than primary. All it does is decrease your sample: given that it does so at random, you can be confident that it neither increases nor reduces whatever biases there may be in your original data set.

    Bootstrap sampling is not useful for this kind of situation. Bootstrap sampling's purpose is to calculate more accurate standard errors (or confidence intervals) in situations where the regression (or other statistical test) relies on distributional assumptions that are materially breached by the data being analyzed. The bootstrapping uses subsets, randomly selected with replacement from the sample, to attempt to emulate repeated random sampling from the population, and under certain assumptions that are weak enough to make it pretty broadly applicable, that works and produces reasonable estimates of the sampling distribution of the coefficient or test statistic in question. It is not intended for, nor capable of, removing or attenuating biases in the data itself.

    Without any information available about how to distinguish primary from secondary events in 2011, I don't see a way out for you. Here's what I would probably do in your situation. I would probably go through the 2012 data set (where you know which events are primary and which are secondary) and try to develop some model to distinguish the primary from the secondary events based on the other variables you have, perhaps something like a logistic model (or some other non-parametric approach) that would assign each observation a probability of being secondary, and that is reasonably well calibrated in the 2012 data. Then I would apply that model to calculate a probability of being a secondary event to each observation in the 2011 data. Then my robustness analysis would consist of excluding from the 2011 data all observations that exceed a certain threshold probability of being secondary. (Acutally I'd probably try several different thresholds and hope to see that the conclusions don't vary very much.)

    Comment

    Working...
    X