First difference with panel data with multiple observations per year

Fredrik Svanberg

Join Date: Dec 2019

Posts: 5
#1

First difference with panel data with multiple observations per year

14 Dec 2019, 07:33

Hi!

I am wondering if it is at all possible to do a first difference regression with panel data that has multiple observations per year.

The data I am using is a survey that about 20 000 people answer every year. I have observations from between 2011 and 2017. I would like to see how people's social trust is affected by different variables that I also have included in the data set (such as gini coefficient for all the counties that the respondents are from).

If i run the command "tsset IDnr year, yearly" i get this result:

panel variable: IDnr (unbalanced)
time variable: year, 2011 to 2017, but with gaps
delta: 1 year

And when I run the regression "reg d.y d.x1 x2" I get a very low number of observations (about 2 000).

What am I doing wrong? I am a total beginner att both statistics and Stata so please excuse me if it is super obvious.

Thanks in advance!

Last edited by Fredrik Svanberg; 14 Dec 2019, 07:37. Reason: Adding tags
Tags: first difference, observations, panel data, regression, time variable
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2169
#2

14 Dec 2019, 09:53

The drawback to using FD in this context is that it only uses observations for which adjacent years have complete cases. You should use fixed effects instead, as it maximizes the number of complete cases observations used. In addition you should include a full set of year dummies, as this will help with the missing data problem if some survey years have systematically lower responses. And, perhaps this is a typo, but how come x2 was not also differenced?

I think your best bet is

Code:

xtset IDnr year xtreg y x1 x2 i.year, fe vce(cluster IDnr)

If you want, you can define a missing data indicator for each unit/year combo, and then put in lagged or lead values in the above estimation to see if the missing data is systematic and not explained by the unobserved effects or time effects.
1 like
Comment
Fredrik Svanberg

Join Date: Dec 2019

Posts: 5
#3

14 Dec 2019, 13:34

Originally posted by Jeff Wooldridge View Post

The drawback to using FD in this context is that it only uses observations for which adjacent years have complete cases. You should use fixed effects instead, as it maximizes the number of complete cases observations used. In addition you should include a full set of year dummies, as this will help with the missing data problem if some survey years have systematically lower responses. And, perhaps this is a typo, but how come x2 was not also differenced?

I think your best bet is

Code:

xtset IDnr year xtreg y x1 x2 i.year, fe vce(cluster IDnr)

If you want, you can define a missing data indicator for each unit/year combo, and then put in lagged or lead values in the above estimation to see if the missing data is systematic and not explained by the unobserved effects or time effects.

Thank you for the response!

I'm not sure I understand what you are saying regarding "complete cases". I thought Stata never used observations with missing data in regressions. How does this change when running a FD regression as I did? The respondents of the survey are randomly selected every year so there should never be an observation which has data for multiple years.

The reason for not including a "d." with x2 was because I was told to do this only for the variables of interest and not control variables. Is this wrong?

Basically what I am trying to do with my data is to look at how social trust is affected by for example inequality (gini-coefficient) and proportion of foreign born citizens in Swedish counties. I have these values for every year and every county (gini and proportion of foreign born). The respondents of the survey also have a county variable indicating which county they are from.

I was told the next step after doing a basic multivariate regression was to do a FD regression to find out how social trust changes with changes in the values of the independent variables.

Knowing this, would you still say that fixed effects is my best option? I understand that I will not really be looking at changes with a fixed effects model, but if that is the best next step from a multivariate analysis I think I'd rather go with that anyway.

Thank you again!
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2169
#4

14 Dec 2019, 18:47

So you don't have a panel data set, which means you shouldn't be using xtset or differencing or any panel data method. This raises a puzzle: Why did you have ANY observations when you differenced. My guess is that some identifiers are reused in subsequent years (or sometimes an individual does appear more than once).

You should use straight OLS regression but probably include dummy variables for the different counties (also called county fixed effects). Your variables of interest change at the county level, and so you are probably just as well off creating a pseudo-panel by creating county averages for all of the individual-level variables. Then you'd have a county-level panel data set, and you can use fixed effects then (or differencing if you prefer).

And you've been given bad advice about differencing. It's an estimation method. You have to start with a model before you can transform it. A situation where some variables get differenced and others don't is strange. May I recommend my introductory econometrics book so you can read up on the panel data material (if you decide to create a county-level panel data set).

JW
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#5

15 Dec 2019, 03:26

To expand a little on the valuable comments of Jeff Wooldridge

Code:

reg d.y d.x1 x2

will select observations for which those variables are non-missing, so that for each individual

y is defined in the present year and in the previous

x1 ditto

and x2 is defined in the present year. You are right:

I thought Stata never used observations with missing data in regressions.

-- and this is what is biting.

In Stata an observation is (in the jargon of other software) a row, case or record in the dataset.

Your sense appears different, possibly a panel defined by one or more observations with the same individual.

In Stata terms, an observation can have only one value of a year variable.

You cross-posted at https://www.reddit.com/r/stata/comme...with_multiple/ Please note our policy on cross-posting, which is that you are asked to tell us about it. See https://www.statalist.org/forums/help#crossposting The folks at Reddit must speak for themselves but telling them about this thread would be a good idea to stop people saying things already said -- and because people may there should want to see what Jeff says.
1 like
Comment
Fredrik Svanberg

Join Date: Dec 2019

Posts: 5
#6

15 Dec 2019, 03:32

Originally posted by Jeff Wooldridge View Post

So you don't have a panel data set, which means you shouldn't be using xtset or differencing or any panel data method. This raises a puzzle: Why did you have ANY observations when you differenced. My guess is that some identifiers are reused in subsequent years (or sometimes an individual does appear more than once).

You should use straight OLS regression but probably include dummy variables for the different counties (also called county fixed effects). Your variables of interest change at the county level, and so you are probably just as well off creating a pseudo-panel by creating county averages for all of the individual-level variables. Then you'd have a county-level panel data set, and you can use fixed effects then (or differencing if you prefer).

And you've been given bad advice about differencing. It's an estimation method. You have to start with a model before you can transform it. A situation where some variables get differenced and others don't is strange. May I recommend my introductory econometrics book so you can read up on the panel data material (if you decide to create a county-level panel data set).

JW

I think I actually have your book! I'll have to check! Think we might have read it as part of a course I took. Definitely feel like I have a bit to learn when it comes to these things.

Sounds like county fixed effects is what i should use then. Do I simply create a dummy for each county and then include them all in the regression or is there more to it?

Basically this would be the code for creating a dummy for one of the counties?
gen Stockholmdummy= lan==1

Thank you so much for your help!
Comment
Fredrik Svanberg

Join Date: Dec 2019

Posts: 5
#7

15 Dec 2019, 03:42

Originally posted by Nick Cox View Post

To expand a little on the valuable comments of Jeff Wooldridge

Code:

reg d.y d.x1 x2

will select observations for which those variables are non-missing, so that for each individual

y is defined in the present year and in the previous

x1 ditto

and x2 is defined in the present year. You are right:

-- and this is what is biting.

In Stata an observation is (in the jargon of other software) a row, case or record in the dataset.

Your sense appears different, possibly a panel defined by one or more observations with the same individual.

In Stata terms, an observation can have only one value of a year variable.

You cross-posted at https://www.reddit.com/r/stata/comme...with_multiple/ Please note our policy on cross-posting, which is that you are asked to tell us about it. See https://www.statalist.org/forums/help#crossposting The folks at Reddit must speak for themselves but telling them about this thread would be a good idea to stop people saying things already said -- and because people may there should want to see what Jeff says.

Thanks for replying!

Sorry about not reading the cross-posting policy, I'll post a link in the reddit thread leading here at once.

It seems really strange, as Jeff noted, that I am getting any observations at all when running the FD-regression. I don't think there are any matching ID numbers, is there a quick way to check this though?

Thanks again!

Last edited by Fredrik Svanberg; 15 Dec 2019, 04:04.
Comment
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#8

15 Dec 2019, 12:13

Basically this would be the code for creating a dummy for one of the counties?
gen Stockholmdummy= lan==1

You can do it like that, but factor variable notation makes it easy to include dummies. Since it seems your lan variable is numeric you can just include i.lan in the regression and dummies for each county will be automatically included. An added advantage is that this allows for postestimation with margins.

I don't think there are any matching ID numbers, is there a quick way to check this though?

Code:

bysort id: assert _N == 1

No message means only one observation per id. An error message means more than one. If you get an error message you can list the duplicate id's like this.

Code:

bysort id: gen n = _N list if n > 1
1 like
Comment
Fredrik Svanberg

Join Date: Dec 2019

Posts: 5
#9

16 Dec 2019, 06:44

Originally posted by Wouter Wakker View Post

You can do it like that, but factor variable notation makes it easy to include dummies. Since it seems your lan variable is numeric you can just include i.lan in the regression and dummies for each county will be automatically included. An added advantage is that this allows for postestimation with margins.

Code:

bysort id: assert _N == 1

No message means only one observation per id. An error message means more than one. If you get an error message you can list the duplicate id's like this.

Code:

bysort id: gen n = _N list if n > 1

Thank you! That worked!
Comment

Announcement

First difference with panel data with multiple observations per year

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment