Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    As for A), B), and c) this is a very complicated question and it does not have a straightforward answer. I agree with many other members of this Forum that there is little or no legitimate role for stepwise regression approaches. But there is no simple way to summarize how one goes about selecting variables for inclusion in a model. Certainly nothing simple enough to cover in a post here. I think the most important issues are:
    It is best to have a theory behind what you are investigating and have the model reflect the relationships predicted or assumed by that theory as best you can.
    It is important to include as many confounders as you can. Inferences based on models that omit confounders produce misleading results.
    It is important to exclude colliders. Inferences based on models that include colliders also produce misleading results.
    If you are using a linear model, but the real relationships are not truly linear, it may be important to transform variables or include higher power terms or interactions to achieve good model fit.

    Those are the general principles. But applying them in real situations is quite complicated. A lot of thought and research into prior studies of related questions is required to do that.

    As for D), this one is simple: interaction and effect modification are the same thing.

    Comment


    • #32
      Clyde Schechter
      Thanks for your clarifying. It makes sense now about confusion around model building.

      Comment


      • #33
        Hi Prof.,

        I have query about backward EBackward elimination begins with the largest model and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. How to choose p- values for variable with more than two level? For example with binary variable and continuous variable we get only one p-value. But variable with more than 2 levels we get 2 or more p- values.

        1) which p-value should be used to remove categorical variable with levels - highest or lowest p-value?
        2) Should one remove the whole categorical variable with levels based on one p-value or just remove level with highest/lowest p- value. But then problem arises with the number of observations- on removing levels regression analysis removes all the observations related that level?

        Thanks

        Comment


        • #34
          I have query about backward EBackward elimination begins with the largest model and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. How to choose p- values for variable with more than two level?...
          Many of the experienced responders in this Forum regard stepwise variable selection as an egregiously bad statistical practice and decline to provide assistance in doing it. I am among them. Perhaps somebody with a more accepting attitude toward it will respond, though, in a sense, I hope not. This practice truly should be exterminated from the literature.

          Comment


          • #35
            Clyde Schechter Thanks for your response.

            a) which other methods you suggest for variable selection in health sciences ?
            b) Does stata have an option to calculate median time between two dates. For example

            obs 1: First date 1-1-2020 Second date 1-1-2021
            obs2: First date 2-3-2021 Second date 2-3-2022
            obs 3: First date. 1-2-2019 Second date 1-6- 2020
            obs..n

            Comment


            • #36
              a) First, I assume that you have already set out your research goals sufficiently clearly that you know exactly what variables participate as explanatory/predictor variables and outcomes in those relationships, and which are of concern only as potential covariates in the analysis. My preferred approach is to map out a directed acyclic graph of all the variables that our best theoretical understanding says can impact, or be impacted by, the main predictor variable(s) or the outcome variable. Any variable that can impact both is a potential confounder and should automatically be included in the model. Any variable that can be impacted by both must be omitted from the model as a collider. Any variable that lies on a causal path from the main predictor(s) to the outcome (a mediator) must be omitted from the analysis, unless you are doing a path analysis and specifically want to quantify the direct and indirect effects.

              The determination of what impacts what should take study design into account. So, for example, if the study uses matched pairs, the exactly matching variables are no longer confounders and should not be included in other aspects of the analysis. Where there is some question whether a variable actually impacts or is impacted by one of these, bivariate analyses such as correlations or cross-tabulations or mean differences can tell you whether the associations are large enough to appreciably bias the estimates of the effects you are trying to study. By all means bear in mind that the purpose of variable selection is to improve the accuracy of study estimates, and everything hangs on the relationships among variables in the data sample. Consequently inferential test statistics and p-values have nothing to do with it--they don't answer sample-level questions--and should play no role at all in variable selection.

              The inclusion of other variables is optional. Including variables that impact the outcome, though not the explanatory variable(s), can be useful in reducing residual variation. On the other hand, if too many variables are included in the model, you can end up with overfitting, or in the extreme, making the model unidentifiable. So these last decisions become matters of judgment. If the data set is too small too support an analysis with even just the mandatory included confounders from the first paragraph, then you are in trouble, and probably should start fresh with a new study design.

              b)
              I'm not sure what you mean by the "median time between two dates." Do you mean the date that lies midway between them? If so
              Code:
              egen midpoint = rowmean(first_date second_date)
              replace midpoint = floor(midpoint)
              format midpoint %td

              Comment

              Working...
              X