Time series models with binary outcome

Max Coleman

Join Date: Mar 2017

Posts: 24
#1

Time series models with binary outcome

12 Mar 2017, 12:39

I'm wondering about the best way to do a time series model where the outcome is binary (depressed/not depressed), and there are three waves.
The simplest approach seems to be a random effects model: "xtlogit depressed X X2 X3, re." But some of my regressors only have data for 2 out of 3 waves, so if I include them, Stata won't use all three waves of data (I can tell because the output says the "max" number of obs per group is 2). Is there any way to fix this—i.e., to use the max number of available waves per regressor?

One potential option is to type: "logit L0.depressed L(0/2).X L(0/2).X2 L(0/1).X3, where X3 is the variable that doesn't have any data in wave one. But would that work? Would it produce unbiased estimates?

Another option is to observe how a change in X affects the change in depression: "logit D(0/1).depressed L(0/2).X L(0/2).X2 L(0/1).X3." But can this model account both for people who become depressed (0 --> 1) and people who become undepressed (1 --> 0)? It seems like the "logit" command wouldn't work here since the values can be either -1, 0, or 1 depending on whether the person became depressed, undepressed, or remained the same, whereas logit assumes a binary outcome.

Do any of these models help reject the possibility of reverse causality? For example, if the random effects model shows that X–X3 really are associated with depression, can I know that they lead to depression rather than depression leading to them?

Any thoughts on a better model would be greatly appreciated!
Max

Last edited by Max Coleman; 12 Mar 2017, 12:43.
Tags: categorical, panel data, regression, Time Series
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

12 Mar 2017, 12:58

The simplest approach seems to be a random effects model: "xtlogit depressed X X2 X3, re." But some of my regressors only have data for 2 out of 3 waves, so if I include them, Stata won't use all three waves of data (I can tell because the output says the "max" number of obs per group is 2). Is there any way to fix this—i.e., to use the max number of available waves per regressor?

No. If the data aren't there, they aren't there and the observations can't be included. If you have reason to believe the missing values are missing at random, you might try using multiple imputation. (Note: Stata's multiple imputation does not support -xtlogit-, but you can fit the same model using -meqrlogit-, which is compatible with multiple imputation.) Another possibility, though less likely, is if linear interpolation/extrapolation of the missing x's is reasonable. See -help ipolate-.

One potential option is to type: "logit L0.depressed L(0/2).X L(0/2).X2 L(0/1).X3, where X3 is the variable that doesn't have any data in wave one. But would that work? Would it produce unbiased estimates?

The lag operators don't create information ex nihilo. You will just be distributing the missing values somewhat differently in the estimation. But if X is missing in wave 2, L1.X is missing in wave 3, etc. You will have the same missing data problem as you started with: it will just look somewhat different.

Another option is to observe how a change in X affects the change in depression: "logit D(0/1).depressed L(0/2).X L(0/2).X2 L(0/1).X3." But can this model account both for people who become depressed (0 --> 1) and people who become undepressed (1 --> 0)? It seems like the "logit" command wouldn't work here since the values can be either -1, 0, or 1 depending on whether the person became depressed, undepressed, or remained the same, whereas logit assumes a binary outcome.

As you have yourself remarked, the outcome variable would have 3 values and would not be suitable as a dependent variable for logit. It might be useful in some other model, such as -ologit- or -mlogit-, if you can find a theoretical justification for thinking about it that way.

Do any of these models help reject the possibility of reverse causality? For example, if the random effects model shows that X–X3 really are associated with depression, can I know that they lead to depression rather than depression leading to them?

No. There is no analysis that can do this. Causality is always inferred based on the study design and theoretical considerations.
1 like
Comment
Max Coleman

Join Date: Mar 2017

Posts: 24
#3

13 Mar 2017, 15:37

Just to clarify, the issue isn't of "missing" data, but data that was systematically excluded from one wave (in this case, wave one) because the construct hadn't been invented yet. So I would indeed be able to write "logit L0.depressed L(0/2).X L(0/2).X2 L(0/1).X3," without creating information ex nihilo. My question is whether or not I can ask Stata to use the maximum number of available waves per variable when running the regression. I assume from your answer that this is not possible in a random effects model.

As for my last question: While it's true that causality cannot be definitely established, some models are more convincing than others, and that's really what I'm asking here. For example, a first differences model would help reduce the worry about reverse causality, since it could show how a change in X from wave one to wave two leads to a change in Y (depression) from wave two to wave three.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

13 Mar 2017, 15:49

some models are more convincing than others, and that's really what I'm asking here.

I suppose convincing is in the eye of the beholder. I have never seen a statistical model that I found convincing with respect to causality, and to speak directly to your example, I don't see how a first-difference regression would, in the absence of other information, tell us that the change in X caused the change in Y and not the other way around, nor that both of those changes were, in fact, caused by something else, nor just spuriously associated.

Perhaps somebody else with a more open mind about causality will have something to say about this.

My question is whether or not I can ask Stata to use the maximum number of available waves per variable when running the regression. I assume from your answer that this is not possible in a random effects model.

Well, I'm not sure what you mean by this. An observation in Stata is either included or it is not. It is included if all of the variables in the model have non-missing values; otherwise it is excluded. If X3 was systematically excluded from wave 1, then any mention of X3 in the model would automatically lead to the exclusion of any observation from wave 1. Moreover, if you coded -logit L0.depressed L(0/2).X L(0/2).X2 L(0/1).X3- you would actually end up with only observations from wave 3, because for the wave2 observations, L1.X3 would be missing. There is no way for different observations to be included with a different set of variables.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2593
#5

13 Mar 2017, 16:18

Maybe it is pedantic but now that I am here - misdirected from the title of the topic - let me add to all the good comments by Clyde:

You have anything but "time series" data!

If you get these things wrong already in the title of the topic, there is a large chance that people who might be able to advise on your topic will never get here.

https://www.kripfganz.de/stata/
1 like
Comment

Announcement

Time series models with binary outcome

Comment

Comment

Comment

Comment