Multilevel Longitudinal Model with missing data

Andrea Arancio

Join Date: Jan 2015

Posts: 56
#1

Multilevel Longitudinal Model with missing data

27 Jan 2015, 10:27

Hi Everybody,

I normally use xtmixed in Stata to test hierarchical linear models (e.g. performance of students nested in schools).
Now it's the first time I need to test a longitudinal model where I have 3 waves of repeated measures. My measured variables refer to students' characteristics (i.e. sex, performance in each wave, hours of study in each wave, etc..) . I'd like to build a model where the first level is repeated time measures and the second level is each student. I have two problems:
Not every students participated to all the three waves;

The three waves are not equally spaced in time (i.e second one is 1 year after the first one and the third one is 6 months after the second one).

I would need your help to know if:
having not equally spaced waves is a problem? Do I need to specify it in some way in Stata?

how do I handle missing data? Can I just put 2 observations for one student (who missed the second wave) and 3 for another (who did not miss any wave)?

Can I build a model like this? xtmixed performance sex studyhours || StudentID:, var ml Or do I need to include a variable taking into account the time?

Actually, I don't really care about the influence of time. My research question is about sex and study hours influencing performance. But I saw many cases where researchers include interactions with a variable specifying time.

Thanks a lot for your help!
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

27 Jan 2015, 11:19

Hello "aesempio" (please re-register with name and Family name by clicking on the "contact us" button below to the right)

In Stata '3, "xtmixed" becomes "mixed". Regarding unequal spaces, I think that's what unbalanced panel data is all about. Missing data - unless MCAR - can be a problem, always. But mixed models do not rely on listwise deletion, therefore you still take profit of the available information. Maybe you could handle the influence of time by employing "age" as a time-varying covariate (also, maybe, test adding squared terms). Evaluating students' performance gives a clue that we should take into consideration between and within effects. In my opinon, if I understood your model, you could employ "xtreg" with practically the same results. Finally, working with multivel models or panel data is, well, an intense task of modeling. Particularly, I wouldn't think about building the "perfect" model immediately, but do, well, the modeling, and that means fitting several models, performing the postestimations and testing if the whole scenario appropriate or not. Sometimes you can start with a full-blown model, sometimes you play with some structures (like with the "Lego"), sometimes something in-between.

Hopefully that helps.

Best,

Marcos

Last edited by Marcos Almeida; 27 Jan 2015, 11:23.

Best regards,

Marcos
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#3

27 Jan 2015, 11:20

•having not equally spaced waves is a problem? Do I need to specify it in some way in Stata?

Depends on exactly what you want to do. If you ultimately want to estimate models with, say, autoregressive residual correlations, then you're out of luck. But for what you describe, it doesn't seem like this should be a problem at all. You may need to include a time variable in your model (or not, see below), but nothing else would be required to specifically "warn" Stata that they waves are not equally spaced.

•how do I handle missing data? Can I just put 2 observations for one student (who missed the second wave) and 3 for another (who did not miss any wave)?

One of the big advantages of mixed effects models over earlier approaches such as repeated measures ANOVA is that the former are tolerant of missing data: you can make use of all the data that is there, and there is no requirement for equal numbers of observations per student. Of course, you still have to think about the possibility of bias resulting from the missingness of some data and how you want to handle that. But the model will run regardless.

•Can I build a model like this? xtmixed performance sex studyhours || StudentID:, var ml Or do I need to include a variable taking into account the time?

Yes. A few comments. If you are using the current version of Stata, the command is now called -mixed-, and the -var- and -ml- options are now the default, hence do not need to be specified. (If you are using an earlier version of Stata you are supposed to tell us that.)

Whether you need to include a variable representing the effects of time, and possibly interactions between your other effects and time depends on the science underlying your work. Even though you are not interested in the effects of time, if there is reason to think that, even after accounting for sex and studyhours there may be changes in performance over time (e.g. if there has been a new curriculum introduced, or if the schools in question have changed their admissions practices, etc.) then, yes, you may need to include time in the model. This is crucial if the sex distribution or studyhours distribution is changing over time, but might be advisable even if they are not. If you are confident, however, that everything else is unchanged (which strikes me as a rather bold assumption, but I don't really know much about this area) then you can leave time out. As for interactions with time: a studyhours#time interaction, say, would be used to capture the possibility that the effect of studyhours on performance differs at different points in time (which might, for example, occur with a change in curriculum). Anyway, whether and how to represent time in your model is not a statistical question: it is a scientific substantive question that I advise you to discuss with your disciplinary colleagues.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

27 Jan 2015, 11:23

aesempio (please take a look at FAQ #6 on how to provide the list with a real full name identifier. Thanks).
Two remarks:
- missing values can sound suspiciosly biasing your analysis if they are informative (please, see -help mi- and related entries for more details and references on this topic in Stata 13.1 .pdf manual). By the way, I'm not clear with what you mean by

Can I just put 2 observations...?

, but my gut-feeling would not point me to that approach. At the top of that, please note that by default Stata uses listwise deletion for students with any missing value. Hence, you will probably end up with a different number of observations per wave.;
-ideally, in panel data repeated measures should be equally spaced (one relevant exception to this rule is babies' growth curves analysis, which is an example of repeated measures usually taken in unequally spaced intervals, as highlighted by Cameron AC, Trivedi PK. Microeconometrics Using Stata. Revised Edition. College Station, TX: Stata Press, 2010: 236).
As a conclusive aside, -xtmixed- has been superseeded by -mixed-: as per FAQ again, you should inform the list about the Stata version you're using for your analysis if it is older than the last one (i.e., Stata 13.1).

Last edited by Carlo Lazzaro; 27 Jan 2015, 11:29.

Kind regards,
Carlo
(Stata 19.0)
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#5

27 Jan 2015, 11:33

One of the many advantages of mixed models is that they do not require balanced data, that is, each subjects does not have to be observed at each measurement point and the measurement points do not have to be at the same time for all subjects. So, your first subject might be observed at baseline, two months and six months, the second at baseline, three months and seven months and a third might be only observed at baseline and 2.5 months. Stata's mixed model routines handle this issue "automatically." If you want a demonstration of this, look at the examples using the NLSWork data set in the help file for mixed. There are, of course, assumptions underlying all of this. These issues are covered in the Stata manuals and you can find a nice discussion in Rabe-Hesketh and Skrondal's book on multilevel modeling in Stata. See section 5.8 of volume 1.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
1 like
Comment
Andrea Arancio

Join Date: Jan 2015

Posts: 56
#6

27 Jan 2015, 14:01

Thanks everybody. I sent a request to change my name (I didn't know when I registered). Moreover, I use Stata 13.0, so I'll change xtmixed into mixed.

@Clyde: thanks for your answer it was very helpful. If I want to include time (with no interactions), may I just put it as one variable of the model? (mixed performance time sex age studyhours || StudentID: ). I tried to include time in that way and it's not significant. Time in my models may assume these values: 1 or 2 or 2.5.
Moreover, I also included age of participants (do I still need to include time?)

@Carlo: what I was trying to say about missing data is that I have complete data for each student (all variables are measured), but for some students I have just one or two measures per each variable (at different times), for some other students I have all the three measures. If I did not missunderstood what Clyde said, that should be fine. Am I right?

@Marcos: After running the models I usually look at the significance of the parameters and of the Wald Chi2. Also I look at the intraclass correlation coefficient and at the variance reductions at each level. In addition, I usually calculate the Cohen's f2 to estimate effects' sizes.
I looked for other postestimation and testing, but it was very hard to find a good explanation. Is there anything else I should do? If so, could you advice me in that sense?

Thanks again. Having your answers was very important.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#7

27 Jan 2015, 15:10

Well, including time simply as a single variable in the model is one possibility: it will pick up linear trends in performance over time. (Or nearly linear: since there are only three time points, this gives you a lot of latitude.) What it will miss is if there is a large spike or dip at time 2 that then recovers to baseline at time 2.5. If there is no reason to worry about that kind of non-monotone behavior, then probably I would stick with just inserting time as is. If, however, it is plausible that performance might go up (or down) temporarily at year 2 and then return to baseline, then to capture that you would need to treat time as a discrete variable. Stata's factor variables are perfect for that, the only problem here being that they only allow integer values. So you could make a new variable time2 equal to time, and then replace the 2.5 value by 3, and then use i.time2 instead of time in the model.

As for the co-occurrence of time and age, there is some loss of precision in estimation entailed by this because within students, these two variables will be very strongly correlated. Nevertheless, you will be able to separately capture the effects of age and the effects of time unless the ages at time 1 exhibit very little variation.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#8

27 Jan 2015, 22:47

Yes, it's fine, since, as Others pointed out, Stata handle this issue automatically, Hence, nothing to worry about from a mechanical point of view.
My take was intended to be similar to Clyde's one about the potential bias induced by misssingness, especially if it turns out to be non-ignorable.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

28 Jan 2015, 03:52

Dear Andrea,

Regarding your comment on postestimation and mixed models,

@Marcos: After running the models I usually look at the significance of the parameters and of the Wald Chi2. Also I look at the intraclass correlation coefficient and at the variance reductions at each level. In addition, I usually calculate the Cohen's f2 to estimate effects' sizes.
I looked for other postestimation and testing, but it was very hard to find a good explanation. Is there anything else I should do? If so, could you advice me in that sense?

I suggest you think about some resources like:

marginal predicted values, conditional margins plots,
all kinds of predictions (xb, fitted, residuals, rstandard),
scatters plot with predictions,
qqplots with predictions,
pairwise comparisons (crude and adjusted),
to name a few.

Also, you can type "help mixed postestimation" to get a broad explanation on the matter.

Best,

Marcos

Best regards,

Marcos
1 like
Comment

Announcement

Multilevel Longitudinal Model with missing data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment