Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appropriateness of comparing logistic regression models

    Hi all,

    I've got a question in regards to some statistical work im attempting to conduct. I have a cohort of patients seperated into two groups - one recieving Test A and one recieving Test B. Groups are not randomised and Test B cohort is at higher risk of the disease of interest compared to Test A cohort (more risk factors present for Test B group in general).

    I've completed logistic models (with choice of predictors based on theory and statistical significance) for both cohorts and created a marginsplot for both Test A being postive and Test B being positive with age as the continuous variable of interest and all other variables at mean values (all other variables are binary predictors).
    I've also completed a marginsplot for both logistic models with all predictors/risk factors as '1' ie. being present.

    My intention is to compare the probabiltiy of a positive test both at mean value of all covariates and also with all risk factors being present.

    The graphs favour Test B having higher likelihoods at all age groups of being positive regardless of whether risk factors present or at mean values. I understand that the results have to be taken into context with the fact that pre test probability is markedly higher in Test B cohort compared to Test A cohort but I'm wondering whether the all risk factors being present model helps limit the effects of this?



  • #2
    No, adjusting for "all" of the risk factors for the disease probably will not eliminate the bias resulting from the higher prevalence in group B to start with. It probably will reduce the bias, but complete elimination would be, at best, a happy coincidence. There are almost always unmeasured risk factors that leave residual confounding in the mix.

    I don't know what you plan to use the results of your analysis for. But I think that the problem is that you have chosen an inherently biased outcome variable, the positivity of the test. If this were my project, I would instead use this outcome restricted to those people who actually have the disease being tested for. In other words, I would in effect be modeling the sensitivity of the test. This would not be biased by the prevalence difference. Of course, this assumes that you do actually have a "true" diagnosis available to guide the sample selection.

    Another better analysis would come from applying test A to population B and test B to population A and then pool those results with the results you currently have. That way both tests would be applied to both populations and the bias due to prevalence would be completely eliminated.

    If the true diagnosis is not available, and it is not possible to retest the sample with the complementary test, then you cannot do what I have just suggested, and what you have done is probably the best that can be done with the existing data. In presenting results you should be modest in your claims and point out that there is likely to be residual bias due to the different prevalences in the populations sampled.

    Added: Although it is tangential to your question, I can't help commenting on
    with choice of predictors based on theory and statistical significance)
    This is a common misunderstanding that arises from the opaqueness of the concept of statistical significance and the dreadful way in which it is routinely taught. Statistical significance has nothing at all to do with choosing covariates for the purpose of reducing confounding bias. What you have done is commonly done, but it is simply flat-out inappropriate. Statistical significance purports to answer the question of whether the association between variables observed in the sample is attributable to such an association at the population level. But confounding is a strictly sample-level issue. If your sample has an imbalance on a predictor variable that is also associated with the outcome variable, then you have confounding, regardless of whether the same phenomenon prevails at the population level. In fact, one way of reducing confounding is to choose unusual samples that do not reflect the population but that break the confounding. Matched-pairs designs are an example of that. The matched variables no longer exhibit differences in the sample that would exist in the population. The key point, though, is that what matters is whether the magnitude of the imbalance on the predictor is, given the magnitude of its association with the outcome, large enough to materially affect the outcome variable. It's all about these magnitudes, and significance has nothing to say about it.
    Last edited by Clyde Schechter; 28 Jun 2023, 19:51.

    Comment


    • #3
      Thanks for the reply Clyde! Unfortunately not everyone who recieved the initial tests A or B recieves the gold standard test. Most of the people recieving the gold standard test have a positive initial test and therefore sensitivity and specificity are not reliable to calculate i believe (as i dont have false negatives - if a negative screening test occured most of the time a gold standard test did not occur).

      So i believe the bias of prevalence unfortunately effects all my statistics.

      I have instead looked at PPV and intend to talk about the limitations/bias involved with the stats given the likely differing pre test probability of having the disease.

      The graph of both logistic regressions of both tests being positive is simply to highlight the vast difference in probabilities regardless of age or all risk factors being present. I intend to acknowledge the significant bias involved given different pre test probabbilities.

      In terms of the latter half of your reply I believe i understand what you’ve said, one of my main initial motivations was an explanatory logistic model so focus was on theory/confounding variable (ie. removal effected other coefficients to a certain decided degree) and then statistically significant covariates of interest (for the purpose of relating to the outcome of interest). If there is something wrong in my thought process please let me know - it is a mix of what is taught the biostats course and my browsing online of a steps of creating an appropriate model.

      Comment


      • #4
        So, here's how I think about covariate selection. If a covariate is not a confounder, why might one want to include it? The usual reason would be that it is strongly associated with the outcome, but balanced across the groups defined by the predictor. In that case, you can improve the precision of your coefficient estimates by including the confounder--reducing residual variance. That's fine. But again, statistical significance does not tell you about that. The relevant statistic, in the linear regression context, would be the correlation between the predictor and the outcome. Statistical significance calculations include that information (or something equivalent to it) but are then contaminated by sample size considerations that are not relevant to the residual variance reduction achievable by including the covariate.

        Comment

        Working...
        X