Merging multiple panel data files

Max Brauer

Join Date: Dec 2020

Posts: 55
#1

Merging multiple panel data files

05 Dec 2020, 13:10

Hello all,

I have individual panel data and two state-level panels I seek to merge together. I have questions about (1) the nature of the panel after the merge, and then (2) I have a question about fixed effects. The individual panel has state and time variables. State panel A includes a variety of economic statistics, and naturally includes state and time variables. State panel B includes dummy variables for regulatory changes in various states, and also includes state and time variables.

(1) My plan is to merge individual panel to state panel A using the state and time variables, and then merge state panel B into the merged panel also using state and time variables. Once everything is merged together, do I have to do anything with xtset? Am I correct that I still have individual panel data? For each individual ID associated with a state, I also have variables associated with (1) econ stats and (2) regulatory changes applied to each ID/state category. So in long form I would effectively have:

ID year state var1 var2 econ stat1 reg change

1 2000 AL 5 10 5.6 1
1 2001 AL 4 12 5.4 0

etc.....

Am I correct that I still have an individual panel here? Just making sure I'm not missing something.

(2) I'm using fixed effects regression. Given that I have an individual panel, would I just use xtreg with fe and vce(robust)? After the merge, I can't think of a theoretical reason to use state-level fixed effects and use "regress" with "i.ID i.year i.state" or something similar.

Any thoughts or helpful comments would be appreciated.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#2

05 Dec 2020, 13:42

It is difficult to give a confident answer to your questions because you do not show example data. Based on your descriptions, it does sound like you will end up with an individual-level panel data set. But descriptions of data sets often omit important details. If you really want a confident answer to your question, post back showing representative examples from each data set and use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Concerning your question (2), I don't agree with you. I assume that you have -xtset- the individual level panel data set with -xtset ID year-. What you have here is three-level data: observations indexed by time, nested within IDs, nested within states. If you stay within the fixed-effects estimator framework, then you cannot get estimates for state level effects (unless individual IDs can appear in different states at different times, which I'm here assuming doesn't happen in your data.) But it is an unusual outcome that does not exhibit some degree of intra-state correlation. When you use -vce(robust)- with -xtreg, fe-, you are in fact using -vce(cluster ID)-, as the non-cluster robust variance estimator is not valid in this context, and Stata makes this substitution automatically without notifying you of it (since version 13, anyway). But this is the wrong level of clustering for this data because observations above the ID level within states are not independent. So you should use -vce(cluster state)- instead.

That said, no matter how you slice it (no pun intended) you are fitting three-level data into the procrustean bed of a two-level model in order to hang on to fixed-effects estimation. I realize that in finance and economics random effects estimation is viewed very skeptically; but in some other disciplines, modeling three level data with a two-level model is viewed skeptically as well. Anyway, if you can bring yourself to consider random-effects estimation, using a multi-level model command (-mixed- or one of the others in the -me- suite) would accommodate this data nicely. (If the number of states instantiated in your data is tiny, then a two-level random-effects model like -xtreg, re- with i.state included would be another way to get a model that accommodates the data structure properly.)
1 like
Comment
Max Brauer

Join Date: Dec 2020

Posts: 55
#3

09 Dec 2020, 13:12

Thank you very much for your thoughtful response! I'm using the PSID, but haven't actually downloaded the data yet, I'm just trying to plan my research strategy. I will post the data shortly
Comment

Max Brauer

Join Date: Dec 2020
Posts: 55

12 Dec 2020, 19:57

Here are some data examples.

Merged dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long x11101ll int wave byte state float(rfaminc wealtheq rdti1 Gini APL)
 4003 1999 41  88278.98   26000          0 .57258236 0
 4003 2001 41  68116.65  113050          0  .5625636 0
 4003 2003 41  41903.67  124000          0 .58462703 0
 4003 2005 41  28458.33  127200          0  .6225922 0
 4003 2007 41 102377.92  150800          0  .6467445 0
 4004 1999 41  51066.96  413500   1.476323 .57258236 0
 4004 2001 41  66938.53  493000        .64  .5625636 0
 4004 2003 41  65079.81  467000          0 .58462703 0
 4004 2005 41  80615.75  424000   1.325301  .6225922 0
 4004 2007 41  315272.8 2920000  .21754894  .6467445 0
 4031 1999 41 11379.823   12700          0 .57258236 0
 4031 2001 41  50551.98   11500          0  .5625636 0
 4031 2003 41  52288.09    4000          0 .58462703 0
 4031 2005 41  49214.46    4000          0  .6225922 0
 4031 2007 41  46403.86    1509          0  .6467445 0

panel ID year (from PSID):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long x11101ll int wave byte state float(faminc wealtheq otherdebt)
 4003 1999 41  62060   26000      0
 4003 2001 41  50880  113050      0
 4003 2003 41  32516  124000      0
 4003 2005 41  23440  127200      0
 4003 2007 41  89560  150800      0
 4004 1999 41  35900  413500      0
 4004 2001 41  50000  493000      0
 4004 2003 41  50500  467000      0
 4004 2005 41  66400  424000      0
 4004 2007 41 275800 2920000      0
 4031 1999 41   8000   12700      0
 4031 2001 41  37760   11500      0
 4031 2003 41  40574    4000      0
 4031 2005 41  40536    4000      0
 4031 2007 41  40594    1509      0

state year panel:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte state int wave float(APL Gini)
 1 1999 0 .56355053
 1 2001 0 .55696815
 1 2003 0 .58149034
 1 2005 0  .6290928
 1 2007 0  .6494615
 2 1999 0 .57444656
 2 2001 0 .55873424
 2 2003 0  .5645777
 2 2005 0  .5907452
 2 2007 0  .6024898
 3 1999 0 .56179285
 3 2001 0 .55758613
 3 2003 1 .59268767
 3 2005 1  .6430813
 3 2007 1  .6684059

To follow up, some IDs appear in different states in different years, so xtreg, fe vce(cluster state) generates a "panels are not nested within clusters" response. I have not used -mixed- before, but just scanned over -help mixed-. What might that code look like in my situation? What about using -xtreg, nonest- or reghdfe? For example:

Code:

reghdfe var1 var2, absorb(id year) vce(cluster state)

To throw another log on the fire, I also want to use PSID sample weights, but have yet to figure out how they would "mix" with my modeling choice.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#5

12 Dec 2020, 21:29

Since the same id can appear in different states in different years you have an even more complicated situation: you have a multiple membership model, which requires the use of crossed random effects. The code would look something like this:

Code:

mixed outcome_var predictor_vars || _all: R.state || x11101ll:

I think that -reghdfe- will have the same complaint about panels not being nested in clusters that -xtreg- gave you. And if it doesn't, that's a bug, not a feature. Your data structure is complex, and it is not going to easily work with any simple model.

I don't know what PSID stands for; even if I did, it's probably not a survey I'm familiar with. On the assumption that the weights you are referring to are inverse probability of sampling weights, you can just use them by including [pweight = name_of_weight_variable] to the fixed effects part of the -mixed- command. That said, be cautious. Large scale surveys usually have a complicated design involving stratification, and one or more levels of random sampling units. While using just the pweights will allow you to obtain unbiased estimates of the regression coefficients, if you do not also account for the other aspects of the design, your standard errors (and, consequently, also your z-statistics, confidence intervals, and p-values) will be wrong. To handle all of that, you need to use the -svyset- command to tell Stata about all those things, and then use the -svy:- prefix with your regression command. This makes things a bit more complicated because -mixed- does not support the -svy:- prefix. So instead of -mixed- you will have to use the equivalent -meglm- with the -link(id) family(gaussian)- options. (That's exactly the same model as -mixed-; I have no idea why one supports -svy:- and the other does not.). However the proper use of -svyset- with a multilevel model is fairly complicated with restrictions about how the weights vary within and across levels. As I seldom deal with complex survey designs in my workflow, I am not well-positioned to advise on the details of that.

I hope that somebody who is more comfortable with that will pick up the thread.
Comment

Max Brauer

Join Date: Dec 2020
Posts: 55

13 Dec 2020, 14:44

reghdfe works. Example:

Code:

reghdfe ihsmdti1 rfaminc rwealtheq deregpost2004 Gini nfamunit agehead sexhead MShead Emplyd health edu
> , absorb( x11101ll wave) vce(cluster state)
(MWFE estimator converged in 2 iterations)

HDFE Linear regression Number of obs = 15,805
Absorbing 2 HDFE groups F( 11, 50) = 13.75
Statistics robust to heteroskedasticity Prob > F = 0.0000
R-squared = 0.5867
Adj R-squared = 0.4828
Within R-sq. = 0.0284
Number of clusters (state) = 51 Root MSE = 0.5379

(Std. Err. adjusted for 51 clusters in state)
-------------------------------------------------------------------------------
| Robust
ihsmdti1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
rfaminc | -7.48e-07 2.73e-07 -2.74 0.009 -1.30e-06 -1.99e-07
rwealtheq | -3.31e-09 7.64e-09 -0.43 0.667 -1.87e-08 1.20e-08
deregpost2004 | -.0016603 .0276877 -0.06 0.952 -.0572727 .053952
Gini | -1.243044 .3460432 -3.59 0.001 -1.938092 -.547996
nfamunit | .0447431 .0078767 5.68 0.000 .0289222 .0605639
agehead | .0035338 .0051983 0.68 0.500 -.0069074 .0139749
sexhead | .0786515 .0556093 1.41 0.163 -.0330431 .1903461
MShead | .2083072 .0282865 7.36 0.000 .1514921 .2651224
Emplyd | .0052707 .0222711 0.24 0.814 -.0394621 .0500034
health | -.0035387 .0066171 -0.53 0.595 -.0168295 .0097521
edu | .0199724 .0209914 0.95 0.346 -.02219 .0621348
_cons | .6810665 .3406203 2.00 0.051 -.0030894 1.365222
-------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
x11101ll | 3161 0 3161 |
wave | 5 1 4 |
-----------------------------------------------------+

Why do you say this is a bug? My question is loaded more with statistical theory than Stata code.

Second question: if few individuals move across states, and I just drop these subjects from the panel, my understanding is that reghdfe is designed for multilevel fixed effects, so I could chose to run the above regression without bugs? At that point, the decision between -reghdfe- and -mixed- is informed by the classic debate of fixed vs. random effects?

Last edited by Max Brauer; 13 Dec 2020, 14:54.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#7

13 Dec 2020, 15:54

Well, to the extent that I remember and understand the calculation of the clustered variance estimator with panel data, it requires that the clusters consist of groups of panels, which, in your data, is not the case That said, it's been a very long time since I learned that and it is not a subject I have revisited. Sergio Correia, however, is quite knowledgable, and if his program does not enforce that requirement I'd be inclined to think that it's OK to do that. Perhaps there is an alternative calculation that does not rely on that assumption. But I'm really not certain either way.

If there are very few individuals who move across states during the study observation period, then you would resolve the problem about vce(cluster state). But don't do this unless we're really talking about only a handful of observations: it may be that people who move are different with respect to things that are relevant to your analysis. (One of your variables looks like it's an employment indicator: that could be strongly related to interstate mobility. If it's also related to your outcome variable, then tampering with the data in this way could be a problem. I also worry about health being related to interstate mobility and perhaps the outcome. And I'm not exonerating the other variables either--it's just that I can't always figure out from their names what they might represent.)

At that point, the decision between -reghdfe- and -mixed- is informed by the classic debate of fixed vs. random effects?

Actually, no. The issue here is the hierarchical structure of the data. -reghdfe-, like -xtreg, fe- is a two-level model. Your data, however, has three levels, with obervations nested in individuals, who, in turn, are nested in states (after you remove people who move). There are no three-level fixed-effects models. Now, in some disciplines, there is a deep aversion to random effects models, so much so that a mis-specified model using fixed effects is preferred to a correctly specified model using random effects. In other disciplines, the reverse is true. It is a debate that touches on the usual -fe- vs -re- debate, but it is different.
Comment
Max Brauer

Join Date: Dec 2020

Posts: 55
#8

13 Dec 2020, 19:13

I appreciate all your thinking on this matter. It is always insightful.

As I explore my options, what would be an easy code to detect IDs who crossed state lines? I was thinking something along the lines of bysort ID: egen modevar=mode(state), but that is imprecise, and perhaps there is an better way to define IDs associated with multiple states over time (I have 5 waves that cover a 10 year period).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#9

13 Dec 2020, 20:05

Code:

by ID (state), sort: gen byte mover = (state[1] != state[_N])
Comment

Announcement