Avoiding confounders when sampling from a larger cohort

Tia Denek

Join Date: Jan 2021

Posts: 17
#1

Avoiding confounders when sampling from a larger cohort

27 Nov 2021, 00:29

I am trying to look at glucose level variation based on another variable (var2) to see how var2 influences my dependent variable. I have a cohort of 1000 people but only need to take 300, i was thinking of making glucose level a normal distribution then taking 100 from those with glucose levels above 180 (high glucose group) and 100 from those with less than 50 (low group) and 100 from the "normal" range group. Now I really don't want my 300 to have another strong confounder or to be biased in anyway and in my survey there are lots of other variables i could look out to see if they're confouders, but i don't know where to start. I was thinking of doing a linear regression to see which variables influence glucose levels then try to ensure that the 300 i choose are matched by these variables so they dont influence the dependent variable too much. for instance for gender not to be a confounder, in each 100 I'm hoping to have 50 male and 50 females but also age distributions aren't even.

Any advice on how I can go about having a sample group which varies by glucose levels and var2 and is adjusted for other possible confounders? I really want to have a sample group where i could properly measure the influence of var2 without confounders being a big influence
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30104
#2

27 Nov 2021, 11:08

Glucose level being a continuous variable, what you are proposing will be difficult to do, and probably not very effective at eliminating confounding.

If your variable var2 is a categorical variable, then it will be easier to reduce confounding by sampling matched pairs. For example, if var2 is dichotomous, then you can assure that for every var2 = 1 male you also sample a var2 = 0 male as its match, and for every var2 = 1 female you also sample a var2 = 0 female as its match. If age is a potential confounder, you can impose a constraint that for every var2 = 1 person of a given age, you match that person to a var2 = 0 person who is of the same age to within some relatively narrow window of acceptability. Now, you can only take this so far. If you try to impose a large number of matching constraints, you will find that there are too many people for whom no suitable match exists in the cohort. And since exact matching on continuous variables such as age also tends to lead to people being unmatchable, you have to match the continuous variables within some narrow range. (This is known as caliper matching.) But when you do that, the match becomes imperfect as a control for confounding and some slips back in. These difficulties with matching are the reason that matching is used sparingly in real world research. In theory it is the ideal way to deal with confounding; in practice it is limited by strong feasibility issues. Consequently one typically matches only on a small number of variables that are expected to be very strong confounders. The remaining confounding is then dealt with by including them as covariates. Finally, never forget that with observational data, no matter how many variables you match on or adjust for in the modeling, there always remains the possibility of residual confounding by unmeasured variables.

If you decide to go this route, a search of this Forum will reveal many threads about how to perform matching in Stata. There are a number of variations on this theme, and probably one of them will suit your needs.
Comment
Tia Denek

Join Date: Jan 2021

Posts: 17
#3

27 Nov 2021, 23:54

thank you so much
I also have another question regarding my regression. I performed a multivariate regression to see which variables significantly affect blood glucose. when I perform scatter glucose and bmi i see a plot that looks like there is a positive correlation between the two, and when i do a univariate regression i see that the coefficient of bmi is positive, but when I add more variables although still significant the coefficient of bmi became negative. I'm quite confused about this and what it means. I realized that it is when I add a certain variable into the regression that bmi's coefficient becomes negative. Does this mean that adjusting for this variable, bmi will actually have a negative relationship with glucose? Im quite confused how that would be and would appreciate any guidance
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30104
#4

28 Nov 2021, 10:07

This is known as Lord's paradox, which is a variant of Simpson's paradox. The Wikipedia page on Simpson's paradox is quite good and I recommend you read it. The gist of it is that when we add new variables to an analysis, thereby adjusting for their effects on the outcome, the effects of the original variables can change, even radically. Simpson's paradox is not posed in terms of regression analysis but rather in terms of contingencies of categorical variables, but the underlying principles and conclusions are the same. It isn't really a paradox at all--it's just that the behavior of three variables can be counter-intuitive.

In your situation, there are substantial correlations among glucose, bmi, and the new variable (which, for brevity, I will call newvar from here on.) In addition to the Lord's/Simpson's paradox effects mentioned above, there is another phenomenon that is likely to be in play. While we commonly do linear regressions, most relationships in the real world are not strictly linear over a wide range. So it may be that the best linear fit between newvar and glucose tends to overpredict the glucose levels. It may happen that adding in the bmi can "correct" that overprediction if we give bmi a negative sign.

So there are multiple possibilities. And the main thing is not to be surprised by this sort of thing. Anything can happen when you add a new variable to a regression.

To get a better feel for what is going on specifically in your data, I suggest you rerun your regression including both bmi and newvar, and then do the following:

Code:

margins, at(bmi = 20(5)50) at(newvar = (appropriate_range_of_values_for_newvar)) marginsplot, xdimension(bmi) name(by_bmi, replace) marginsplot, xdimension(headroom) name(by_headroom, replace)

In the above code replace appropriate_range_of_values_for_newvar by a numlist that covers the interesting range of newvar and includes half a dozen or more points, just like (20(5)50) does for bmi.
Comment

Announcement

Avoiding confounders when sampling from a larger cohort

Comment

Comment

Comment