Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Specifying the linear predictor for random intercept model on panel data.

    Dear Stata Community,

    I am running a random intercept model on a longitudinal dataset where I am interested in examining the association between certain protective/risk factors and suicide ideation among a group of psychiatric inpatients. Initially, 500 patients were assessed in 2018 and about 50% of those consented to be followed up at a later time (between 2019 and 2020). My outcome of interest is suicide ideation and my (primary) predictor variable is score on a questionnaire. Below is a list of variables in my dataset:
    1. suicide_ideation (binary; present/absent)
    2. score: score on questionnaire quantifying risk (continuous)
    3. age: (continuous)
    4. sex (binary)
    5. occasion: 1,2
    6. participant id
    My research question is: Is the score on the questionnaire that is quantifying risk associated with suicide ideation across the two occasions? I am having a hard time determining what the the linear predictor for the random intercept model should be based on the research question. I've seen in some places that an timexpredictor i.e. occasionxscore interaction term could be included. When is it appropriate to include such an interaction term (i.e. what does this interaction term represent?)

    I've run the following models:
    Code:
    melogit suicide_ideation score || participant_id, or
    => obtained non-significant association
    Code:
    melogit suicide_ideation score i.occasion
    => obtain non-significant association for either of the predictors
    Code:
    melogit suicide_ideation c.score##i.occasion
    => intrxn term not signifcant, score-depression association significant, occasion not significant

    How can I interpret these varying results and significance outcomes?


    Thank you,
    Last edited by Sam Honer; 23 Jan 2021, 05:25. Reason: added tags

  • #2
    Statistical models should not be chosen based on the previous research. Rather, they should be chosen using objective measures like statistical significance of predictors, BIC, AIC, etc.

    Comment


    • #3
      Contrary to Sergey's advice in #2, models should not be chosen based on statistical significance of predictors. That's not science, that's p-hacking. The use of AIC or BIC for model selection is sometimes appropriate for selecting among nested models, but only for certain limited purposes. Nothing you have described about this project suggests that those purposes apply to it.

      Statistical models should be chosen to reflect as closely as possible the real-world data generating process, and parameterized in such a way as to make it simple to estimate the parameter of interest.

      Your study design involves baseline and follow-up evaluations. Ordinarily, to make this a cohort study, people with suicidal ideation at baseline would be excluded from the study. Was that done in your case?

      Or, perhaps your goal is to determine if the risk questionnaire score is associated with decreasing probability of suicide ideation over time, so inclusion of people with suicidal ideation at baseline was appropraite?

      Which is it? The analysis would differ, depending on which design was used.

      Comment


      • #4
        Clyde Schechter, with all due respect, I delegate you to

        Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction.

        and similar references. You suggest choosing a model because you think that it reflects "the real-world data generating process", not because the data suggest it. That is contrary to the purpose of Statistics = Data Science. We should not get attached to models because we wish they were true, driven by our previous life experiences. We should approach models in a cold-blooded, scientific, mathematical fashion. If a suitable formal test (t-test, Wald test, LR test, etc) says that the true coefficient is 0, how can you be certain that your opinion is more important than that of the test?

        Statistics is separate from domain knowledge. It lets some theories coming from social and medical sciences live. And it lets some of them die.

        Comment


        • #5
          I think we are talking about different things. If you have several possible models for the data in mind, all of which are compatible with the known aspects of the data generating process, then you would discard models whose predictions are incompatible with observed data. But statistical significance of coefficients in the model does not speak to that. At most it speaks to whether certain values of that parameter of the model are consistent with the data. (I will spare you my long rant and references about why statistical significance shouldn't actually be used at all.) The same for AIC and BIC in terms of identifying parsimonious models (though a parsimonious model is not necessarily more correct) or to avoid overfitting the data But to reject a model you need to show that its predictions are incompatible with observation: that means looking at model predictions and comparing them to observation.

          In the context of the post that started this thread, Sam Honer is not in the business of comparing a few models to test which one is correct or parsimonious. He has some data and a hypothesis he wants to test but he is unsure which model provides a test of his hypothesis. Or, outside the framework of hypothesis testing (which I prefer to be), he has a specific parameter he wants to estimate and he is unsure which model provides him with an estimate of that parameter. The models he provides differ in what estimands are actually being estimated. And as he has not clearly stated what parameter he actually wants to estimate, and has not given the design of his data collection, it isn't possible to say which model would be best for that purpose. But significance testing of coefficients has nothing to say about that. It is irrelevant whether a coefficient in one of those models is significant or not if that coefficient does not in fact estimate the parameter he is interested in estimating.

          I raised the spectre of p-hacking here because I have seen often on these pages that people are often desperate to get a p < 0.05 result and will often use models that inappropriately represent their data or estimate the wrong parameter to do so, and will prefer the model that gives them a p < 0.05. That is, as I said, not science.

          Comment

          Working...
          X