Looking for some advice on state-year panel data analysis

Michael Hartney

Join Date: Jul 2018

Posts: 10
#1

Looking for some advice on state-year panel data analysis

20 Jul 2018, 13:46

Hi All,

Long time reader, first time poster. I'm hoping some folks who have worked on models similar to the set I'm working on might be able to chime in about best practices regarding a few issues related to regression models using fixed effects and exploiting within-state changes over time to identify possible downstream changes in key state-level outcome variables. My dataset consists of state-level data on union density 1983-2007 for all 50 states. I am trying to estimate the causal effect of a significant change in 1 state's labor laws (Pennsylvania) in the year 1988. In other words my treatment variable of interest takes the value of 1 if the observation is for Pennsylvania for 1989-2007. All other observations take the value of 0. I then include state- and year-fixed effects and cluster my standard errors by state.

Questions:

(1) What are the advantages of including a lagged dependent variable? My results are robust to the inclusion of a lagged DV, but are there downsides to this more saturated model? And, if I include a lagged DV, should I also be including state FEs?

(2) I've heard some feedback that I should include state-specific linear time trends. Longtime poster Clyde has done a series of posts about this trying to help folks on Statalist. I confess I remain confused about how I would set this up on Stata. Right now my model is simply: reg y(union_density) + treatment_variable + i.statefips + i.year, cluster(statefips). Can someone explain how a state-specific linear time trend would be incorporated? And, the advantage?

(3) Third, and finally, what is the sentiment toward abandoning a treatment variable like the one I described above (where 1=a dummy for the state that was "treated" during years it was "treated" and everyone else gets a 0) for a nonparametric approach where a series of year dummies pre- and post- the law change are used for the treatment state (all other states get a zero). In other words, the inclusion of dummy variables for each relative year to PA's law change would impose no structure on the pattern of time trends either pre- or post-treatment. This flexibility could help identify non-linear impacts on state union over time no?

Appreciate any thoughts folks might be willing to offer.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

20 Jul 2018, 15:07

#1 is a substantive question, not a statistical one, and the content here is way out of my league, so I won't comment on it. Hopefully somebody knowledgeable in that domain will.

#2. Whether to include a linear time trend, or even state-specific linear time-trends is a modeling decision that, again, ultimately is a substantive, not a statistical question. The time frame of the study, 25 years, is fairly long. You don't say what your outcome variables are. If they are generally subject to a linear time trend over intervals of that length (as many things are), then, yes, you should include a time trend in your model to remove that as a source of confusion. And if the real world is such that those time trends can differ by state, then, state-specific linear time trends would be appropriate to include. So, being neither an expert in this domain, nor a telepath, I can give you no more specific modeling advice than just enunciating those principles. I have no idea how they apply in your context.

That said, if you do want to include a linear time trend, the syntax is simple:

Code:

xtset state year xtreg outcome i.PA_89_07 year other_covariates_as_appropriate, fe

And if you want to include state specific time trends, it's

Code:

xtreg outcome i.PA_89_07 c.year##i.state other_covariates_as_appropriate, fe

#3. Again a substantive rather than a statistical question, but some general principles can be set out. The simple model relying on just the i.PA_89_07 variable stipulates that the 1988 PA law has an immediate effect on the level of the outcome that first appears in 1988 and remains constant in 2007. Otherwise put, the effect of the law is a one-time bump that persists indefinitely (well, at least until 2007). If that is a reasonable model of how the law works, then I would stick with that because it has the virtue of simplicity. But if that does violence to reality, then you need a more flexible model that could accommodate realistic expectations. Among the possible alternative realistic expectations is that the effect will be a change in linear time trend (if there is a linear time trend in the first place) or that the laws effects will have a gradually increasing onset, reach a plateau at some point, and then perhaps decay to some extent. Effects of that kind would be poorly estimated with a model relying just on i.PA_89_07. The advantage of using separate indicators for PA in each year 1989 through 2007, is that it can capture any kind of effect trajectory whatsoever, although characterizing that once you have the result might prove somewhat difficult, and the temptation to see some completely arbitrary pattern of effects and spin a yarn as to why that's real as opposed to noise might prove overwhelming. Anyway, if you want to go this route, the syntax would be something like this:

Code:

gen byte PA = (state == numeric_code_assigned_to_PA) xtreg outcome i.PA##i(1989/2007).year i.year, fe margins PA#year marginsplot, xdimension(year)

The -margins- and -marginsplot- commands should help you understand the output of the regression better.

There are also models that are intermediate in complexity between simple i.PA_89_07 and the full semi-parametric approach just discussed. The code for implementing them depends on the specific model chosen.

So, ideally you should formulate a sensible hypothesis about how you expect the PA law to change the outcomes based on your understanding of the underlying science. Then design your model to reflect that hypothesis, and then fit the model. If, however, this is just an exploratory study, then staring a bit at the -marginsplot- result and coming up with a description of that is a reasonable thing to do, provided you do not later refer to your results as having answered the research question, but merely having suggested an answer needing some other confirmation.

I hope this helps. I also hope that others will offer some guidance on the substantive issues here.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#3

20 Jul 2018, 23:38

Michael:
welcome to this forum.
Just an aside to Clyde's (as always) comprehensive reply: if you want to add a lagged dependent variable in a panel data regression, you should consider -xtabond-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Michael Hartney

Join Date: Jul 2018

Posts: 10
#4

21 Jul 2018, 15:11

Clyde,

Thank you for your very thorough response. A few follow-up questions, if I may:

First, let me elaborate a bit on the data itself and my “theoretical” expectations. I am trying to model public sector labor union density (unit of analysis is thus a continuous variable measured at the state-year level). I.e., what % of public employees in a state are members of a labor union in a given year. I have reasons to believe that an exogenous “shock” (i.e., a newly adopted law (adopted in 1988) should have caused a spike and a permanent increase in public sector union density in PA beginning in 1989. Obviously, I’m trying to build a model that suggests that the law was the causal force leading to higher density observed in 1989 through to the end of the time series in Pennsylvania relative to the other control states.

So my baseline model was:

xtset state year
xtreg union_density i.PA_89_07 i.year other_covariates_as_appropriate, fe

Let me start with a silly question which I’m actually quite ashamed to say I don’t know the answer to. Usually, I employ the “areg” command when including fixed unit and fixed-year effects:

areg union_density i.PA_89_07 i.year, absorb(state) cluster(state)

However, if I xtset my data with “xtset state year” does the xtreg suite of commands then automatically include year-dummies such that I don’t need to include i.year?

I take your point that the question of linear time trend vs. state-specific linear time trends vs. year-dummies is a substantive question that I must grapple with relative to my theoretical expectations about whether density is likely to grow linearly at the national level (for all unions in all 50 states) or whether the time trend might vary by state. So, I take your point on that.

Now, I do have a question on the commands you suggested at the bottom in response to my 3^rd question about using the semi-parametric approach and then the margins commands. I had some trouble with the code you posted (I suspect maybe it’s because my outcome variable is continuous and margins is for dichotomous, or something else?).

When I run:

gen byte PA = (state == numeric_code_assigned_to_PA)
xtreg union_density i.PA##i(1989/2007).year i.year, fe
margins PA#year
marginsplot, xdimension(year)

Everything works fine up until I command Stata to do:

margins PA#year

At that point I get back:

margins PA#year

Adjusted predictions Number of obs = 1,250
Model VCE : Conventional

Expression : Linear prediction, predict()

------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
PA#year |
0 1983 | . (not estimable)
0 1984 | . (not estimable)
0 1985 | . (not estimable)
0 1986 | . (not estimable)
0 1987 | . (not estimable)
0 1988 | . (not estimable)
0 1989 | . (not estimable)
0 1990 | . (not estimable)
0 1991 | . (not estimable)
0 1992 | . (not estimable)
0 1993 | . (not estimable)
0 1994 | . (not estimable)
0 1995 | . (not estimable)
0 1996 | . (not estimable)
0 1997 | . (not estimable)
0 1998 | . (not estimable)

Did I do something wrong here?

Thank you,
Michael

Last edited by Michael Hartney; 21 Jul 2018, 15:19.
Comment
Michael Hartney

Join Date: Jul 2018

Posts: 10
#5

21 Jul 2018, 15:12

Originally posted by Carlo Lazzaro View Post

Michael:
welcome to this forum.
Just an aside to Clyde's (as always) comprehensive reply: if you want to add a lagged dependent variable in a panel data regression, you should consider -xtabond-.

Thank you, Carlo. I will look into this suite of commands for lagged DVs.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#6

21 Jul 2018, 16:08

Let me start with a silly question which I’m actually quite ashamed to say I don’t know the answer to. Usually, I employ the “areg” command when including fixed unit and fixed-year effects:

areg union_density i.PA_89_07 i.year, absorb(state) cluster(state)

-areg- and -xtreg, fe- will give you the same coefficients, but the standard errors are calculated somewhat differently. I don't really recall the details, nor do I have an in depth understanding of it, but I gather that when the cluster variable specified in -vce()- is the same as the grouping (panel, i.e. state in your case) variable, then the ones calculated by -xtreg, fe- are the correct ones.

However, if I xtset my data with “xtset state year” does the xtreg suite of commands then automatically include year-dummies such that I don’t need to include i.year?

No. State fixed effects are automatically taken care of by all the -xt- commands, but time effects must be specified directly in the varlist of the command.

Everything works fine up until I command Stata to do:

margins PA#year

At that point I get back:

margins PA#year

Adjusted predictions Number of obs = 1,250
Model VCE : Conventional

Expression : Linear prediction, predict()

------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
PA#year |
0 1983 | . (not estimable)
0 1984 | . (not estimable)
0 1985 | . (not estimable)
0 1986 | . (not estimable)
0 1987 | . (not estimable)
0 1988 | . (not estimable)
0 1989 | . (not estimable)
0 1990 | . (not estimable)
...
Did I do something wrong here?

Sorry, my fault. When doing this with panel data, you have to specify the -noestimcheck- option in the -margins- command to avoid this. The -noestimcheck- option should not be used indiscriminately: when you unexpectedly get (not estimable) results you need, in general, to explore why, and fix the underlying problem rather than sweep it under the rug. But in this situation with a -fe- model, this result is expected and the use of -noestimcheck- is appropriate. This happens all the time, and I don't know why I always forget to mention it in the first place.
Comment
Michael Hartney

Join Date: Jul 2018

Posts: 10
#7

21 Jul 2018, 17:15

Looks good (I think). Thanks, Clyde.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#8

21 Jul 2018, 18:51

Wow, those results look impressive! But, why is there no variation in the outcome variable prior to 1989?
Comment
Michael Hartney

Join Date: Jul 2018

Posts: 10
#9

21 Jul 2018, 19:04

Originally posted by Clyde Schechter View Post

Wow, those results look impressive! But, why is there no variation in the outcome variable prior to 1989?

Ah, yes. I made a mistake in my code. My code was:

xtreg density i.PA##i(1989/2007).year i.year, fe vce (cluster s)
margins PA#year, noestimcheck
marginsplot, xdimension(year)

But it should have been:

xtreg density i.PA##i(1983/2007).year i.year, fe vce (cluster s)
margins PA#year, noestimcheck
marginsplot, xdimension(year)

Revised graph (I think) is now correct:
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#10

21 Jul 2018, 19:10

Very nice, impressive results!
Comment
Michael Hartney

Join Date: Jul 2018

Posts: 10
#11

22 Jul 2018, 10:42

Originally posted by Clyde Schechter View Post

Very nice, impressive results!

I agree, they look great. I'm sorry to pester you with another question but I'm curious if you could give me a pithy explanation for what exactly these results mean. Obviously one can see that post-1988 there's a statistically significant difference between union density in PA compared to the other 49 states in the control group. However, margins is graphing predicted probabilities I presume. I worked more regularly with a package called Clarify or the prchange command so I'm less familiar with what margins is exactly leaving me with. Are these simply the coefficients on the dummy variable for a given year after the treatment, or an actual prediction of the value of union density in a given year conditional on all other predictors (other control variables) at their mean value? Etc. Thanks for all of your help, Clyde.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#12

22 Jul 2018, 10:58

Why do you think -margins- is showing predicted probabilities? Your regression command was -xtreg-, not -xtlogit-, nor -xtprobit-, so there are no probabilities to predict. Also most of the results lie outside the 0-1 range so they can't be probabilities of any kind.

Anyway, what -margins- is giving you here is the expected value of union density in each year in PA (red curve) and the other states (blue curve).

-margins- is a complicated command. I think the clearest explanation of it is in the excellent Richard Williams'
https://www3.nd.edu/~rwilliam/stats/Margins01.pdf. It is lucidly written and contains a number of worked examples. None of those examples is quite as complex as your model, but it does cover some simpler interaction models and the principles are exactly the same. -margins- is also one of Stata's most useful commands for people doing this kind of modeling, so your time invested (which won't be that much because it's fairly brief) reading this will be amply repaid.
1 like
Comment
Michael Hartney

Join Date: Jul 2018

Posts: 10
#13

22 Jul 2018, 16:05

Originally posted by Clyde Schechter View Post

Why do you think -margins- is showing predicted probabilities? Your regression command was -xtreg-, not -xtlogit-, nor -xtprobit-, so there are no probabilities to predict. Also most of the results lie outside the 0-1 range so they can't be probabilities of any kind.

Anyway, what -margins- is giving you here is the expected value of union density in each year in PA (red curve) and the other states (blue curve).

-margins- is a complicated command. I think the clearest explanation of it is in the excellent Richard Williams'
https://www3.nd.edu/~rwilliam/stats/Margins01.pdf. It is lucidly written and contains a number of worked examples. None of those examples is quite as complex as your model, but it does cover some simpler interaction models and the principles are exactly the same. -margins- is also one of Stata's most useful commands for people doing this kind of modeling, so your time invested (which won't be that much because it's fairly brief) reading this will be amply repaid.

Thank you, Clyde. I was actually a student of Prof. Williams many years ago at Notre Dame. I wish I had taken more of his courses. Thank you for passing along this useful write-up on the margins command. You're certainly correct that it's a suite of commands that I should invest the time understanding. Of course, I should have been aware that xtreg wouldn't be producing predicted probabilities given the continuous nature of the outcome variable. I confused myself, in part, because I'm working with another data set for this very same project. However, the other data set is not true panel data in that I have individuals (whose union status I observe 1=Yes; 0=No) nested in all 50 states 1983-2007. The problem that I've been encountering in this supplementary analysis is the fact that I have very small N-sizes in each state-year cell such that I have approx. 150-200 individual observations in PA for '83, 150-200 for '84 so on and so forth. Given the previous guidance you have given me, do you have any additional suggestions that you could offer on how to estimate models where the outcome is dichotomous and the "treatment" I am trying to exploit remains a policy change within a state during a specific year. I worry that a probit or logit estimator with fixed year and state effects will be sensitive to the small-cell sizes I mentioned above such that "noise" from year-to-year may give inaccurate estimates.

I should add that the reason why I prefer to include this supplementary analysis (in addition to the well-balanced state-year panel data on union density you've been helping me with so far) is that I have specific occupation indicators for these individuals in this CPS dataset such that I can conduct further placebo/falsification tests to show that the change in PA labor law only impacted the probability of union membership status for public sector workers in occupations that the law specified (whereas it has no impact on workers at similar levels of pay and prestige whose occupations didn't happen to be included under the labor law change in PA).

Among other things, one statistical hangup I'm having is that the discussion we had earlier about the potential attractiveness of estimating models using separate indicators for PA in each year 1989 through 2007, is that it can capture potential effect trajectory. Would you be concerned about trying that approach if the DV is switched--as it is in this case--to a small (200 per year) individuals nested in state-years, as opposed to the nice clean measure of union-density by state-year that is aggregated and measured with less measurement error.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#14

22 Jul 2018, 18:11

I think the viability of your supplementary analysis really depends on how rare or frequent union membership is in your data.

Sample sizes of 150-200 per treatment group#year combination may well be sufficient if the effect is large enough (which the union density analysis seems to suggest) and if the proportion of union members in the control group is not too small. I think it would be worth your while to just try it. The way you will know if your sample is inadequate is that the standard errors of the interaction terms will be large. Even if that is true, the estimates you get may well still point in the same general direction as your union-density analysis. If you don't make a fetish out of "p < 0.05" that's still helpful. There are also a few things you might consider doing to sharpen things up a bit. Since the union membership data is sequential cross sections rather than panel data, you might consider combining years into groups of two or three consecutive years to increase the n's in each. That will give you fewer time points, but the outcome in each will be less noisy. You lose some ability to distinguish all sorts of different patterns of the effect, but at this point I think that's less important (see below).

A noisy outcome variable does decrease your ability to identify and distinguish many different patterns in the trend, because the confidence intervals around a plot with this outcome will be wider and the points will jump around a bit more. Here's another thought. I don't know what the union density measure is and how it relates to the union membership measure in your other data. But assuming that they follow each other somewhat reasonably, you might try compensating, in the union membership analysis, by imposing a model with fewer degrees of freedom. Looking at your results with union density, I would describe them as showing a slight upward secular trend over time in the control states. In PA we see a similar slight upward trend until 1989. Between 1989 and 1992 we see a steep upward climb, and then from 1992 on we see a secular trend that rises more steeply than that of the control group, but is much more gradual than was seen in 1989-1992. If you think that the union membership response function will look somewhat like that, imposing that through a model would strengthen your analysis in the face of the poorer quality measure. If you agree with my reading of your results, and if you agree that it is reasonable to think that the union membership variable will behave similarly, to set up an analysis like that, look at the -mkspline- command. You probably want to make a linear spline from the year variable, with joinpoints (knots) at 1989 and 1992 and then use the spline variables in your model, interacted with an indicator for PA.
Comment
Michael Hartney

Join Date: Jul 2018

Posts: 10
#15

27 Jul 2018, 07:49

Clyde,

I can't thank you enough for ALL of the advice you've given me in this thread. It's helped me move my project forward considerably. Looking forward to being a more regular consumer (and when possible helper) on the Statalist forums.

Regards,
Michael
Comment

Announcement

Looking for some advice on state-year panel data analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment