Mixed vs fixed effects linear regression models

Matteo Bargagli

Join Date: Feb 2021

Posts: 17
#1

Mixed vs fixed effects linear regression models

03 Feb 2021, 07:01

Dear all,
I would like to thank you in advance,
I am a new user of Stata and I have some trouble analysing a database characterized by multiple time-points for each variable. The dataset is composed of these variables (in columns): "patient_ID" (the panel variable), "visit_number" (the time variable), "patient_treated"(y/n), "outcome_variable".
When "visit_number"=0 (baseline), "patient_treated" is always =n.
In addition, the decision to treat patients is not randomized but based on a medical decision (e.g., patients with a lower outcome variable at "visit_number"=0 have higher probability to start the drug).
My purpose is to analyse wheter the explanatory variable "patient_treated" is or is not significantly associated to the changes of the outcome variable during time ("visit number").
I used a mixed effects linear regression as follow:

xtset patient_ID visit_number

xtreg outcome_variable i.patient_treated

Is this correct or in my case, it's better to use a fixed effects model?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#2

03 Feb 2021, 11:13

So there are two separate treatment effects operating here. One of them is the effect that takes place within individual patients. The other is the average outcome difference between treated patients and untreated patients. If this were a randomized assignment, we might expect these two effects to be the same, and in that case the random effects model (which is what you show in #1) would be the more efficient estimator.

With haphazard assignment to treatment (meaning not really randomized, but not based on a medical decision either) one might hope that these effects would be the same and you might gamble on the random effects model being OK. (Actually, in this case, I would do both models, and go with the random effects results if the models gave almost the same results, but go with the fixed effects model if the results are appreciably different.)

But you are in the worst situation for causal inference: a medical decision was made. If there is even a tiny amount of validity in the medical decision making (there may not be, but usually there is something), then these two effects will differ. So the between patients effect cannot be estimated without confounding bias, whereas the within-patient effect can. So you must use -xtreg, fe- here, as it is a pure within-patient estimator and is not contaminated by the bias in treatment assignment.
1 like
Comment
Matteo Bargagli

Join Date: Feb 2021

Posts: 17
#3

03 Feb 2021, 12:14

Originally posted by Clyde Schechter View Post

So there are two separate treatment effects operating here. One of them is the effect that takes place within individual patients. The other is the average outcome difference between treated patients and untreated patients. If this were a randomized assignment, we might expect these two effects to be the same, and in that case the random effects model (which is what you show in #1) would be the more efficient estimator.

With haphazard assignment to treatment (meaning not really randomized, but not based on a medical decision either) one might hope that these effects would be the same and you might gamble on the random effects model being OK. (Actually, in this case, I would do both models, and go with the random effects results if the models gave almost the same results, but go with the fixed effects model if the results are appreciably different.)

But you are in the worst situation for causal inference: a medical decision was made. If there is even a tiny amount of validity in the medical decision making (there may not be, but usually there is something), then these two effects will differ. So the between patients effect cannot be estimated without confounding bias, whereas the within-patient effect can. So you must use -xtreg, fe- here, as it is a pure within-patient estimator and is not contaminated by the bias in treatment assignment.

Thank you! A very appreciated suggestion.
So, there is no way to estimate the interaction of treatment with time between groups(either treated or not treated)? Adjusting for baseline outcome variable could be a way?

Last edited by Matteo Bargagli; 03 Feb 2021, 12:20.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#4

03 Feb 2021, 14:05

Well, it seems that your treatment variable starts at 0 (no) for everybody and then converts to 1 (yes) once treatment begins, and is then sustained there. In fact your treatment variable is, in a sense, a treatment#time interaction already.

You also have a fine-grained time variable referring to visit number. If visit_number is just baseline vs follow-up, then it is too similar to the treatment variable to be used for a treatment#time interaction. If, however, you have multiple visits in the pre- and post- treatment eras for each patient, you can interact treatment with visit_number to see if there is a variation over time in the treatment effect. Doing this effectively requires that you have a prior sense of how long a delay (if any) is expected before treatment effects are noticeable, and whether they rise and plateau, or rise and then taper, or just come on abruptly and are sustained.

If you have enough prior information to do that sensibly, there is no reason it can't be included in the fixed-effects model.

A separate question is what to do with the baseline value. One approach is the one you mention, including it as a covariate in the model. Another approach is to simply have the baseline observation as one of the observations for each patient in the regular fixed effects model. Algebraically these models are equivalent, although statistically they differ. The difference largely hinges on whether the baseline outcome variable is measured in exactly the same way and under the same conditions as the follow-up outcomes are measured. In formal, structured RCTs, usually baseline measurements are obtained in the study and are done using the same technique as all subsequent measures. In this setting, either approach is fine, although my personal preference, because it makes the results easier to interpret, is to have a separate baseline observation for each person, not use it as a covariate. In informal treatment trials such as yours, however, it is not uncommon to extract a baseline value from medical records obtained in the course of routine care. In that case, there can be methodological differences in these measurements leading, if nothing else, to heteroskedasticity, and often to different validity. So in this setting where the baseline measurement is obtained differently from the subsequent measurements, it should be used as a covariate, rather than as an additional observation in the study.
Comment
Matteo Bargagli

Join Date: Feb 2021

Posts: 17
#5

03 Feb 2021, 14:39

Originally posted by Clyde Schechter View Post

Well, it seems that your treatment variable starts at 0 (no) for everybody and then converts to 1 (yes) once treatment begins, and is then sustained there. In fact your treatment variable is, in a sense, a treatment#time interaction already.

You also have a fine-grained time variable referring to visit number. If visit_number is just baseline vs follow-up, then it is too similar to the treatment variable to be used for a treatment#time interaction. If, however, you have multiple visits in the pre- and post- treatment eras for each patient, you can interact treatment with visit_number to see if there is a variation over time in the treatment effect. Doing this effectively requires that you have a prior sense of how long a delay (if any) is expected before treatment effects are noticeable, and whether they rise and plateau, or rise and then taper, or just come on abruptly and are sustained.

If you have enough prior information to do that sensibly, there is no reason it can't be included in the fixed-effects model.

A separate question is what to do with the baseline value. One approach is the one you mention, including it as a covariate in the model. Another approach is to simply have the baseline observation as one of the observations for each patient in the regular fixed effects model. Algebraically these models are equivalent, although statistically they differ. The difference largely hinges on whether the baseline outcome variable is measured in exactly the same way and under the same conditions as the follow-up outcomes are measured. In formal, structured RCTs, usually baseline measurements are obtained in the study and are done using the same technique as all subsequent measures. In this setting, either approach is fine, although my personal preference, because it makes the results easier to interpret, is to have a separate baseline observation for each person, not use it as a covariate. In informal treatment trials such as yours, however, it is not uncommon to extract a baseline value from medical records obtained in the course of routine care. In that case, there can be methodological differences in these measurements leading, if nothing else, to heteroskedasticity, and often to different validity. So in this setting where the baseline measurement is obtained differently from the subsequent measurements, it should be used as a covariate, rather than as an additional observation in the study.

Thank you again.
My dataset have multiple time-points for the outcome variable, furthermore although all patients were not treated at visit 0, not all patients started treatment at visit 1 (maybe at visit 2 o 3). In addition, some patients started treatment at visit 1, then stopped at visit 2 and so on. (that's why I was so confused on what is the best analysis for this kind of dataset).

Then there is the problem of the treatment effect (for simplicity I sad to have 1 outcome variable, but of couse I have many): for some outcome variables (e.g. serum osmolarity) the effect comes immediately (so if a patient started treatment at visit 1, I am sure the effect will be fully visible at the same visit, with a plateau on the following visits). On the contrary for other outcome variables (e.g. bone mass), I suppose the effect may be seen on a longer term and probably also progressively increasing.

As regards the baseline outcome variable, although there was no randimization in treatment administration, the cohort that I am studying is prospective, so the outcome variable is measured in the same way at baseline and all follow-up visits.

So following your explanation, the interaction patient_treat#visit_number should be possible (?) and the definitive code to perform my analysis will be the following:

xtset patient_ID visit_number

xtreg outcome_variable i.patient_treated#visit_number, fe (or i.patient_treated##visit_number?)

is that correct in every scenario I described?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#6

03 Feb 2021, 17:04

So the firs big reaction I have is, you should not expect to have a single model for every outcome. The time courses of the outcomes are, by your description, quite different and require different models.

The differences in the models will be how you specify time. Now, if you just specify time as a discrete variable, i.patient_treated##i.visit_number (note the ##, not #, and the i.) you have something that might be called a universal model that can accommodate all the different outcomes' time courses. The problem is you will not be able to interpret the results easily because the same value of visit_number will mean different things for different outcomes. Actually, it's even worse then that. Given that different people start treatment at different times, even for the same outcome, the same visit number means different tings for different patients.

So you will want to use a time variable that is not directly visit_number, but is derived from it. Probably you would want your time variable to be something like number of visits (or maybe amount of time) since the start of treatment, and 0 for all visits before that. That at least overcomes the differences between patients problem. But the patterns of coefficients you will get for these variables will still differ by outcome. For serum osmolarity you will see some baseline coefficient for the time = 0 variable, and then time = 1 and all larger values will show some other value (but it will be more or less the same value for all of them since a plateau is immediately reached.) You would be able to recognize patterns of increasing effect over time with later plateaus, or an immediate jolt followed by decay by corresponding patterns in the coefficients of the time variables. But this is a lot of work to interpret each outcome's pattern, and there may be subtleties or generalities you will not see staring at coefficient tables. So it is better, if possible, to have some sense of what the shape of the response vs time since onset of treatment curve looks like, and try to represent some transform of the time since onset of treatment variable that will represent it in a simpler way. Things like linear or cubic splines, for example, provide flexible models. The possibilities are numerous and can't possibly be covered fully in a Forum post.
Comment

Announcement

Mixed vs fixed effects linear regression models

Comment

Comment

Comment

Comment

Comment