Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • ANOVA: validity of dropping independent variables with little variability?

    Hi guys,

    I have a question about the validity of an approach rather than its execution in Stata. Please forgive me if those questions generally aren't addressed here, as I'm still fairly new to Statalist.

    I am analyzing a dataset of about 500 emergency room visits looking at emergency department (ED) utilization among patients with diagnosed heart failure. One outcome of primary interest is whether these patients are admitted to the hospital from the ED or whether they're treated at the ED then released. I'm using an ANOVA model to predict this, using a total of 13 categorical independent variables.

    Prior to the study, as we'd discussed predictive variables we wanted to include, we'd planned to include patients' race (black/white/etc.) and ethnicity (Hispanic/non-Hispanic) to make sure we didn't observe any disparities in that regard. Now that I have the dataset, only 1.4% of the sample is Hispanic and 94% of the sample is of the same race. Unsurprisingly, the ANOVA model finds these variables to be insignificant predictors of the outcome, and they slightly hurt the model's adjusted R-squared by increasing the model's degrees of freedom.

    I know that stepwise techniques that determine predictive variables post hoc based on p-values in multivariable models are frowned upon. I'm wondering if it would be considered legitimate to remove these independent variables from the model, not because of the lack of predictive value, but because of the lack of variability of these variables in my dataset. Can anyone with a stronger statistical foundation than mine help me here? Am I honor-bound to include these variables in my final model because I'd planned to include them, or is it acceptable to drop them because of how little they vary in my data?

    Thanks!
    Last edited by Blake Dawson; 16 Aug 2016, 12:40.

  • #2
    Many people are concerned about irreproducibility of research results, and, in particular, about people fitting many models and then cherry-picking the results they like best. That seems not to be your intent here. Still, picking a model based on some kind of fit statistic can quickly lead down the cherry-picking path and culminate in a model that is overfit to the noise in the data.

    A dichotomy that is split 493/7 is going to be a worthless variable unless its effects are extremely large (and even then, they will be very poorly estimated). The case for dropping your race variable is a bit weaker than that, as it seems to be split 470/30, which might lead to tolerably precise effect estimates.

    I think the honest thing to do is to first compare the results you obtain with models including and excluding the ethnicity variable. If the results, with respect to other variables, are largely similar, it would be fair to either:

    A) Report both results side by side, acknowledging that the mode lincluding the variable was the planned analysis, but pointing out the lopsided distribution of the ethnicity variable as a rationale for omitting it.

    or

    B) Report only the analysis with the variable omitted, but disclosing that this is a departure from the original plan. Then state that the impact of excluding that variable on the other estimates was negligible, and offer to make the results of the planned analysis available to readers on request (or include them as an appendix).

    If the results for other variables in the two analyses differ materially, then I think you have no choice but to present both sets of results and do some hand-waving about what happened.

    All of that said, I'm wondering how you ended up in this predicament, and thinking how you might avoid this dilemma in the future. You apparently had an analysis plan set out before you began. When you wrote that plan, were you unaware that the data source you would be using would produce so few Hispanic participants? If I had known that ahead of time, I would have planned to not use an ethnicity variable in the first place. If the low prevalence of Hispanics comes as a surprise, it raises a question about whether the data have been sampled correctly, and I would pursue that question with the people who collected the data before proceeding to analyze it. Perhaps they did not follow the recruitment protocol properly; or perhaps some of the data is missing? In planning future studies, if you are unfamiliar with the demographics of the population you will be sampling, it would be wise to get some preliminary data on that before you draw up your study protocol. Life offers lots of data surprises that can't be foreseen or avoided, but this typically wouldn't be among them.

    Comment


    • #3
      I like Clyde's reasonable approach to the matter, but if it were me, I would follow the agreed-upon analysis plan. Here's how I would view it. If omitting the two predictors leaves the others' coefficients and standard errors substantially unaffected, then there is little or nothing to be gained by omitting the two (improvement in R-squared just isn't a motivation to me), and much to be lost by taking that first step onto the slippery slope. On the other hand, if omitting the two does affect the others' estimates, then the two do indeed matter and, by that very fact, I would feel obliged to acknowledge their importance in the model by including them.

      You're using what appears to be a Linear Probability Model for the analysis. Is that because you're concerned about the potential for separation or quasi-separation with thirteen categorical predictors if you were to use, for example, logistic regression to predict hospital-admission versus treatment-and-release?

      Comment


      • #4
        Thanks for the thoughts. As to why we included Hispanic ethnicity, it was just poor planning: we included race and ethnicity in our list of potential predictive variables because it seems like every similar study includes race and ethnicity and despite the fact that our medical center is visited by very few Hispanics. If an impact of race could theoretically show up with a 470/30 split, I'm glad to leave that in the model.

        In regards to Joseph's question, including or omitting either race or ethnicity does not significantly impact the magnitudes of effect sizes of other independent variables, only the model's adjusted R-squared. At this point, I plan to exclude the ethnicity variable from my analysis with a note about why, on the idea that if the only reason not to is the "slippery slope" argument, it's a pretty weak argument.

        Thanks again for your help.

        Comment

        Working...
        X