Creating Lag Variable for Independent-Cross Sections data

Deniz Karabulut

Join Date: Sep 2023

Posts: 2
#1

Creating Lag Variable for Independent-Cross Sections data

27 Sep 2023, 14:22

Hello everyone,

I have a dataset spanning from the year 2004 to 2021, comprising approximately 500,000 observations on average annually. The number of observations varies from year to year, and the number of observations for each cross-section (nuts2) also changes within the year.

I have average temperature and precipitation data within this dataset, and I would like to create their lags.

First, I attempted to use the following code, but it only subtracted the 1st observation of each year and did not generate the lags; it retained the same value:

sort nuts2 year
by nuts2: gen lag1temperature = avg_temp_nuts2[_n-1]

Then, I tried to create lag variables with the following code and received an error:

gen lag_avg_temp = L.avg_temp_nuts2
time variable not set r(111);

tsset year nuts2
repeated time values within panel r(451);

Could you please help me resolve this? I'm new to Statalist, so I apologize for doing anything incorrectly.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

27 Sep 2023, 15:32

It should go something like this:

Code:

xtset nuts2 year gen wanted = L.avg_temp_nuts2

Your -tsset- command was incorrect: the time variable comes after the cross-section variable. But even had you done that correctly, if you got the "repeated time values within panel r(451);" message with what you put, you will get the same thing with a corrected command.

So your data are not suitable for calculating lagged variables. Somewhere in the data you have multiple observations that have the same value of both nuts2 and year. That makes it mathematically impossible to define a lagged variable. After all, if we have an observation in that value of nuts2 and the subsequent year, which of the observations for the preceding year should Stata use to get the lag?

While occasionally this situation arises in circumstances where the user must abandon the idea of using lagged variables, it is overwhelmingly more probable that the problem is that your data set is wrong. To troubleshoot this, the first step is to identify the offending observations:

Code:

duplicates tag nuts2 year, gen(flag) browse if flag

Then you have to figure out what to do with these offending observations. There are several possibilities:
The observations are correct and actually need to be there. Then there must be some other variable(s) which together with nuts2 distinguish unique observations in the data set. If that is the case you need to create a new cross section variable that combines all of those. See -help egen- and look for the -group()- function to do that. (If there is no other such variable or set of variables, then you are in a situation where you truly cannot use lags at all and need to make a new plan.)

The surplus observations are exact duplicates of each other in all variables. While the simplest and fastest next step would be to -drop- all but one observation from each matching set, I urge you not to do that just yet. With correct data management, exact duplicates usually shouldn't be there. Where a data set has shown you it contains errors, there are often other errors lurking that you have not stumbled upon yet. It is better to search for and fix them now than to have them pop up later (especially if later is after other people have relied on your results). So review the entire chain of events from the original source files down to the current data set to see where in the data management the erroneous observations crept in. You may well find other errors in the process of doing that. Fix whatever you find, and then re-create the data set by running the fixed code.

One of the observations is correct and the rest are errors. Here again, resist the temptation to simply -drop- the incorrect ones: review how the data set was created and fix the problems that led to the errors.

The observations are all partially correct but need to be combined into a single observation. Depending on what we are talking about, this is commonly handled with the -collapse- command (see -help collapse- if you are not familiar with it), though sometimes something more complicated is required to reduce them.
Comment
Deniz Karabulut

Join Date: Sep 2023

Posts: 2
#3

28 Sep 2023, 01:55

Originally posted by Clyde Schechter View Post

It should go something like this:

Code:

xtset nuts2 year gen wanted = L.avg_temp_nuts2

Your -tsset- command was incorrect: the time variable comes after the cross-section variable. But even had you done that correctly, if you got the "repeated time values within panel r(451);" message with what you put, you will get the same thing with a corrected command.

So your data are not suitable for calculating lagged variables. Somewhere in the data you have multiple observations that have the same value of both nuts2 and year. That makes it mathematically impossible to define a lagged variable. After all, if we have an observation in that value of nuts2 and the subsequent year, which of the observations for the preceding year should Stata use to get the lag?

While occasionally this situation arises in circumstances where the user must abandon the idea of using lagged variables, it is overwhelmingly more probable that the problem is that your data set is wrong. To troubleshoot this, the first step is to identify the offending observations:

Code:

duplicates tag nuts2 year, gen(flag) browse if flag

Then you have to figure out what to do with these offending observations. There are several possibilities:
The observations are correct and actually need to be there. Then there must be some other variable(s) which together with nuts2 distinguish unique observations in the data set. If that is the case you need to create a new cross section variable that combines all of those. See -help egen- and look for the -group()- function to do that. (If there is no other such variable or set of variables, then you are in a situation where you truly cannot use lags at all and need to make a new plan.)

The surplus observations are exact duplicates of each other in all variables. While the simplest and fastest next step would be to -drop- all but one observation from each matching set, I urge you not to do that just yet. With correct data management, exact duplicates usually shouldn't be there. Where a data set has shown you it contains errors, there are often other errors lurking that you have not stumbled upon yet. It is better to search for and fix them now than to have them pop up later (especially if later is after other people have relied on your results). So review the entire chain of events from the original source files down to the current data set to see where in the data management the erroneous observations crept in. You may well find other errors in the process of doing that. Fix whatever you find, and then re-create the data set by running the fixed code.

One of the observations is correct and the rest are errors. Here again, resist the temptation to simply -drop- the incorrect ones: review how the data set was created and fix the problems that led to the errors.

The observations are all partially correct but need to be combined into a single observation. Depending on what we are talking about, this is commonly handled with the -collapse- command (see -help collapse- if you are not familiar with it), though sometimes something more complicated is required to reduce them.

Dear Clyde,

Thank you very much for your response. My situation falls under the first suggestion you provided. My dataset consists of two parts, which I had merged together. I intend to revert to the original state before merging and define lag variables on the panel data set. Afterward, I will proceed with the merging.
I'm grateful for your help.
Comment

Announcement

Creating Lag Variable for Independent-Cross Sections data

Comment

Comment