How to properly read a longitudinal dataset in Stata

Doris Rivera

Join Date: Feb 2020
Posts: 172

How to properly read a longitudinal dataset in Stata

31 Aug 2020, 07:11

Dear Statalist
I am interested in analyzing a longitudinal dataset with several observations per individual. However, each individual may start in different dates as well as having different amount of observations (similar to an unbalance panel). My point is that for analyzing it, I do not know how to read the data in Stata.

Let’s say that I would like to do a regression like: y x1 x2 x3 x4 through a Fixed effect estimation (in a panel would be: xtreg y x1 x2 x3 x4, fe).
If the time dimension would be a yearly observation per individual, I would do:

Code:

xtset id1 timein

. However, I am not sure enough if this can be done also if time is daily data. I mean, in this case would the “fe” option control for individual time invariant characteristics (as in a panel dataset)?. If I would also like to control for time dummy variables as it is usually done in panel data, should I put a dummy for each day? Could this over parametrize the regression losing a lot of degrees of freedom?

A different complication arises if two or more observations start at the same date. In this case I would have to erase those repeated observations in order to use (xtset id1 timein). However, I might be losing possibly relevant observations. My second questions is if instead of using the daily variable (time), I could use an occasion variable. I mean, I first sort the data by id time, and then build an occasion variable being 1 (for the first observation within individual) 2 (for the second one)…
My point is that in this case, an individual which her first observation start at 2/2/2005 would be compared with an individual which her first observation is at 5/5/2015 (ten years after). So, if I put in the regression an occasion dummy variables, then it won`t capture the effect as the yearly dummy variables in a panel dataset (please correct me if I am wrong).

Is this (occasion variable) a possible approach? Or should I drop the repeated observations and use the daily time variable (xtset id1 timein)?
Thanks a lot in advance for your help. (here you have an example of the dataset)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte y double x1 byte(x2 x3) float(timein occ id1)
0                  . 23 46 15036  1  4
0                  . 23 93 15051  2  4
0                  . 23 78 15901  3  4
0                  . 23 78 15932  4  4
0                 21 23 93 16302  5  4
0                 27 23 73 16406  6  4
0                  . 23 55 16580  7  4
0                  3 33 85 16712  8  4
0                  . 23 93 16953  9  4
0  4.041666666666667 23 93 17080 10  4
0                  . 23 84 17120 11  4
0                  . 55 84 17532 12  4
0                 20 22 81 15168  1  6
0                 14 22 87 15427  2  6
0           19.03125 22 47 15538  3  6
0                  . 22 47 15545  4  6
0 19.038194444444446 22 47 15553  5  6
0                 14 22 81 15812  6  6
0                 30 23 78 15866  7  6
0                 17 22 87 15968  8  6
0                  . 22 87 16071  9  6
0               25.5 22 87 16619 10  6
0                  . 22 87 16628 11  6
0                  . 22 87 16983 12  6
0                  . 22 87 17018 13  6
0                  . 22 87 17029 14  6
0                  . 54 86 17594 15  6
0                  . 54 86 17720 16  6
0                  . 54 86 17994 17  6
0                  . 54 86 18051 18  6
0                  . 54 86 18234 19  6
0                 20 51 47 16250  1  7
0                  . 51 56 16628  2  7
0                 20 51 47 16740  3  7
0                  . 51 86 17001  4  7
0                  6 54 86 17438  5  7
0                 25 54 96 17440  6  7
0                  2 54 86 17475  7  7
1                  6 54  . 17622  8  7
0                  . 54 86 17658  9  7
0                  . 54 86 17843 10  7
0                  . 54 86 17947 11  7
0                  . 54 86 18057 12  7
0                  . 54 86 18176 13  7
0 23.333333333333336 54 87 18480 14  7
0 13.583333333333334 54 87 18513 15  7
0                  . 55 86 18546 16  7
0                  . 54 86 18597 17  7
0                  . 23 96 20636 18  7
0                  . 54 84 15536  1 18
0                  . 54 84 15567  2 18
0                  . 54 86 15585  3 18
1                  . 54  . 16315  4 18
0                  6 23 85 16530  5 18
0                 10 23 85 16559  6 18
0                 10 23 85 16561  7 18
0                 10 23 85 16580  8 18
0                 10 23 85 16589  9 18
0                  6 23 85 16600 10 18
0                 10 23 85 16601 11 18
0                 10 23 85 16617 12 18
0                 10 23 85 16699 13 18
0                 10 23 85 16713 14 18
0                 10 23 85 16727 15 18
0                  2 23 85 16748 16 18
0                 10 23 85 16783 17 18
0                  . 55 47 17841 18 18
0                  . 55 41 18163 19 18
0                  . 55 41 19178 20 18
0                 30 55 85 19267 21 18
0                 28 55 85 19617 22 18
0                  . 23 85 20509 23 18
0                  . 23 85 20515 24 18
0                 20 54 78 20698 25 18
0  32.63993055555556 55 87 20752 26 18
0                  . 55 87 20755 27 18
0 37.333333333333336 55 96 21118 28 18
0                  3 33 85 21284 29 18
0               4.25 33 85 21339 30 18
0                 20 80 47 18079  1 21
0                 30 51 47 18198  2 21
1                  . 51  . 18320  3 21
0                 16 55 47 19650  4 21
0                  . 55 47 19932  5 21
0                 16 51 78 21067  6 21
0                 28 51 78 21117  7 21
0                 20 51 46 21148  8 21
0 18.889930555555555 59 47 21430  9 21
0                  . 32 78 21458 10 21
0                 10 51 78 21535 11 21
0                 20 51 78 21598 12 21
0                 10 51 78 21609 13 21
0                 10 51 78 21626 14 21
0                  . 12 41 21668 15 21
0                 12 51 78 21703 16 21
0                  . 32 78 21705 17 21
0                 18 51 78 21724 18 21
0                 20 51 78 21826 19 21
0                 20 51 78 21878 20 21
0                 30 23 56 19909  1 24
end
format %td timein

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

31 Aug 2020, 21:46

However, I am not sure enough if this can be done also if time is daily data. I mean, in this case would the “fe” option control for individual time invariant characteristics (as in a panel dataset)?

Yes, Stata does not care what the unit of time is. It's just numbers. It takes hints from the formatting of the time variable whether you intend it to be days, or years, but that doesn't affect any calculations it does.

In fact, as I look at this data, I think you should not include timein in your -xtset- command. Just -xtset id1-. The time variable in -xtset- is only relevant if you plan to use time-series operators like lag and lead, or estimate models with autoregressive structure. But because your time intervals are irregular, you can't use those things anyway. So just forget about timein and run your regression.

A different complication arises if two or more observations start at the same date. In this case I would have to erase those repeated observations in order to use (xtset id1 timein). However, I might be losing possibly relevant observations.

Well, as I think you should not include the time variable in your -xtset-, this is no longer an issue. -xtset- won't care about duplicate time values for the same id1 if no time variable is specified. But this raises another question: why would you even have multiple observations on the same date for the same id1 here? Does that make sense in the context of the real world process that led to this data? Or does it mean that your data set has errors? That's a substantive question you need to think about.

If I would also like to control for time dummy variables as it is usually done in panel data, should I put a dummy for each day? Could this over parametrize the regression losing a lot of degrees of freedom?

It would seriously over-parameterize your model. In fact, you could not get anything remotely useful or meaningful out of that. In just the 100 observations shown in your example, there are 98 distinct values of timein, which would mean 97 indicator ("dummy") variables! If you think there are time trends in your outcome that need to be adjusted for, you could do that with a linear timein term, or use a cubic spline for something less rigid.

My second questions is if instead of using the daily variable (time), I could use an occasion variable. I mean, I first sort the data by id time, and then build an occasion variable being 1 (for the first observation within individual) 2 (for the second one)…
My point is that in this case, an individual which her first observation start at 2/2/2005 would be compared with an individual which her first observation is at 5/5/2015 (ten years after). So, if I put in the regression an occasion dummy variables, then it won`t capture the effect as the yearly dummy variables in a panel dataset (please correct me if I am wrong).

This is not a statistical question. It's a substantive question. Not only do you not say what the variables y, x1, x2, and x3 are, you don't even say what the general topic of the research is. This question has no generic statistical answer. The suitability of replacing times by occasions depends precisely on whether doing so is meaningful in the real world context you are working in. Does time effect your outcome in a time-like manner, where there are particular shocks or effects at particular moments in clock time, or is it more a matter of a sequential "aging" process that each id1 will undergo in the same way, though at different starting points? From the way you pose your question, it sounds like you think the former. If so, the use of an occasion variable would make no sense and would treat unlike things as if they were like.

Finally, all of this is interesting, but if this example is representative of your data set as a whole, you have a much bigger problem than these on your hands. Your y variable doesn't look like a typical outcome variable for -xtreg-. The only values it takes on are 0 and 1. Are you sure you're not working with a dichotomous outcome here? Something that might be better dealt with by a conditional logistic regression model perhaps? Also, you have a substantial amount of missing data on the x variables. And, as it turns out in your example data, every observation where y = 1 has a missing value for at least one of the x variables. So your actual estimation sample in the example data has y as a constant value of 0. If your real data are like that, there is nothing to regress: all the coefficients are 0, and the variance components are 0 as well. If in your full data, the value of y is always 0 or 1, and it is 0 except for a very small number of observations, then you are also going to have a lot of difficulty fitting any model to it: estimation of an outcome with a rare value is not going to fly unless your data set is truly enormous.
1 like
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#3

01 Sep 2020, 11:37

Dear Clyde Schechter, thanks a lot for your detailed answer. First, the research is about labor economics and human capital, especially about if having a university degree is worth for students in the labor market. This will be done through several analyses, starting by if it have a positive effect on the likelihood of obtaining a fixed job. Why using xtreg …,fe? Well, I read that "fe" is the way to go if you want to get a causal effect when the data structure is a panel (similar to the DID analysis). Also, even if a logit model would be better for such an outcome, the linear model gives directly the marginal effects, and most of the time it is very similar to xtlogit…,fe (please correct me if I am wrong).

And, as it turns out in your example data, every observation where y = 1 has a missing value for at least one of the x variables. So your actual estimation sample in the example data has y as a constant value of 0

I totally agree with the quote. However, the missing in the independent variable can be fixed using another variable. What is also worrying me is that the amount of persons with the outcome all being 1 or all being 0 is huge, so I think "fe" model will discard those individuals. I am wondering if instead, I could use a "re" model with Mundlak approach (similar to some multilevel models) for trying to get a more efficient model. Here you can see the xtsum of "y"

Code:

Variable | Mean Std. Dev. Min Max | Observations -----------------+--------------------------------------------+---------------- y overall | .035556 .1851828 0 1 | N = 39262 between | .1010234 0 .5 | n = 2672 within | .1737388 -.464444 1.030031 | T-bar = 14.6939

The idea, is to try to go further and thereafter estimate a mixed model including the different studies as a third level. So, I think that only using "xtset id1" as you sugget is not problem for this, right?

Finally, the second part of the analysis will be the study of time to event (survival). But here I am more lost. Let's say I want to study the time to the first fixed job (when y=1) which I assume is a continuous-time survival. I have read that I need a failure variable, which I imagine would be "y" which would be occasion 8 for id1=7. But does this mean that all observations before are discarded? What would happen if thereafter the individual have other different fixed jobs? Also, I have read that when some characteristic are time-varying it is more difficult… As you see, I have several questions hehehe. But I think I should start by understanding how to properly read the data for this survival analysis.

Anyway, I appreciate a lot your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

01 Sep 2020, 12:19

Well, I read that "fe" is the way to go if you want to get a causal effect when the data structure is a panel (similar to the DID analysis).

No, using fixed effects does not get you a causal effect. What it does do is enable you to remove the confounding effects of any time-invariant variables, even those that are not measured in your study. But that's only one step on the road to identifying a causal effect. The DID estimator is a bit better in that, conditional on some strong assumptions, some of which are typically not verifiable, it may give you an estimate of the actual causal effect. In any case, whatever case for claiming causality you might build for using either -xtreg, fe- or a DID model, it would apply equally well to -xtlogit, fe- or a DID model with a logistic link. The logic of causality is indifferent to whether a logistic or linear probability model is being used.

the linear model gives directly the marginal effects, and most of the time it is very similar to xtlogit…,fe

Well, that is true in a general way. But with a logistic model, you can get marginal effects by applying the -margins- command after your logistic regression. So that's no reason to choose either above the other. As for the results typically being similar, this is true when the outcome probability is, overall, in the mid-range. But when the outcome probability is close to zero (as in your case) or one, then the results can be quite different, and the linear probability model may cause problems by, for example, predicting negative probabilities!

As for the rare outcome, actually your situation is better than I had imagined. Although 3.6% y = 1 is pretty low, your sample size is large enough to deal with that because you will still have a total of roughly 1400 positive outcomes. And that is sufficient to support reasonable estimation, provided you don't start diving down into small subgroups.

Why don't we cross the survival analysis bridge when we come to it. There are a few different possible approaches.
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#5

02 Sep 2020, 11:17

Dear Clyde Schechter, you are totally right, "fe" does not gives per se causal effects (it might still be possible correlation between explanatory variables and time-varying part of the error term). However, I was looking for a way to address a DID estimation for this dataset and some scholars said that it should be equally spaced, I mean, a panel data structure. Also, I read the paper you suggest in this post https://www.statalist.org/forums/for...search-setting and if I understood well, when having a panel data, or this type of longitudinal dataset, the DID can be approached by using a Fixed effect estimation. Let's say I would like to test the effect a binary explanatory variable (treatment -> like for instance, if some students get a scholarship) have in my outcome. What a DID do is to compare those having the scholarship before and after, with those not having it before and after. The idea from the paper (correct me if I am wrong) is to obtain such effect but using (xtlogit, fe), since some people may have the treatment in different times.
I assume I could do a similar approach using a mixed model including carrers as supra level (even though I am aware that might not be possible to control for any possible unobserved time.invariant characteristic)?

Why don't we cross the survival analysis bridge when we come to it. There are a few different possible approaches.

Sure. Have you got any suggestion of any document that help me to start with?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

02 Sep 2020, 13:00

Yes, you can do a DID analysis in a mixed-effects multi-level model. And you will not get automatic adjustment for unobserved time invariant characteristics. That's the price you have to pay for having a three level model. If you have enough observed covariates in the model, then you may be OK, though you can never prove that that's the case.

I strongly recommend you read the PDF manual section on the -stset- command to get an introduction to how survival analysis data is set up in Stata. There are different ways of doing it depending on whether you are using time-varying predictor variables or not. The PDF manual is installed with your Stata (unless you are using an ancient version). The easiest way to get to the -stset- chapter is to run -help stset- from the command window. Then when the viewer opens with the help file, click on the blue link to the PDF documentation near the top of the page.
Comment

Announcement

How to properly read a longitudinal dataset in Stata

Comment

Comment

Comment

Comment

Comment