Lagged variable - time series operator vs subscript

Ama Perera

Join Date: Mar 2019

Posts: 43
#1

Lagged variable - time series operator vs subscript

06 Jul 2022, 02:01

Hello everyone,

I'm trying to estimate the lagged effect of my independent variable on my dependent variable. My dataset is a cross-country panel dataset with different firms and years.

Please note that the independent variable is a country-level variable which varies over time (but doesn't vary with the firm_id).

I generated the lag of my independent variable in two ways.

Method 1: using subscripts

bysort ncountry: gen ind_sup = ind[_n-1]

Method 2: using time series operator

xtset firm_id year

gen ind_time = L1.ind

Following is a summary of the dataset.

CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input str62 country double firm_id float year double ind float(ind_sup ind_time)
"Afghanistan" 1101290 2005 .32105717500000003 . .
"Afghanistan" 1101290 2006 .15308987500000001 .3210572 .3210572
"Afghanistan" 1101290 2007 .025807775 .1530899 .1530899
"Afghanistan" 1101290 2008 .151470875 .025807776 .025807776
"Afghanistan" 1101290 2009 .051024324999999995 .15147087 .15147087
"Afghanistan" 1101290 2010 .02869605 .05102433 .05102433
"Afghanistan" 1101290 2011 .0571796 .02869605 .02869605
"Afghanistan" 1101290 2012 .25967055 .0571796 .0571796
"Afghanistan" 1101290 2013 .5672399 .25967056 .25967056
"Afghanistan" 1101290 2014 .54212275 .5672399 .5672399
"Afghanistan" 1101290 2015 .377781225 .5421227 .5421227
"Afghanistan" 1101290 2016 .33739715000000003 .3777812 .3777812
"Afghanistan" 1101290 2017 .369859225 .3373972 .3373972
"Afghanistan" 1101290 2018 .174369825 .3698592 .3698592
"Afghanistan" 1101290 2019 .25233217500000005 .17436983 .17436983
"Afghanistan" 1101290 2020 .17920345 .25233218 .25233218
"Afghanistan" 1134770 2005 .32105717500000003 .17920345 .
"Afghanistan" 1134770 2006 .15308987500000001 .3210572 .3210572
"Afghanistan" 1134770 2007 .025807775 .1530899 .1530899
"Afghanistan" 1134770 2008 .151470875 .025807776 .025807776
"Afghanistan" 1134770 2009 .051024324999999995 .15147087 .15147087
"Afghanistan" 1134770 2010 .02869605 .05102433 .05102433
"Afghanistan" 1134770 2011 .0571796 .02869605 .02869605
"Afghanistan" 1134770 2012 .25967055 .0571796 .0571796
"Afghanistan" 1134770 2013 .5672399 .25967056 .25967056
"Afghanistan" 1134770 2014 .54212275 .5672399 .5672399
"Afghanistan" 1134770 2015 .377781225 .5421227 .5421227
"Afghanistan" 1134770 2016 .33739715000000003 .3777812 .3777812
"Afghanistan" 1134770 2017 .369859225 .3373972 .3373972
"Afghanistan" 1134770 2018 .174369825 .3698592 .3698592
"Afghanistan" 1134770 2019 .25233217500000005 .17436983 .17436983
"Afghanistan" 1134770 2020 .17920345 .25233218 .25233218
"Afghanistan" 1183990 2007 .025807775 .17920345 .
"Afghanistan" 1183990 2008 .151470875 .025807776 .025807776
"Afghanistan" 1183990 2009 .051024324999999995 .15147087 .15147087
end
[/CODE]

As shown in the above data, if you have a look at the bolded row (row 33), the lagged value of the independent variable generated using subscripts method for year 2007 is 0.17920345 (which is the value of the independent variable in 2020, as seen in the row above) while the lagged value of the independent variable generated using time series operator shows a missing value. However, the correct value should be the value for year 2006 which is denoted in red colour font (0.15308). Thus, neither of the generated lag variables give the correct value when the years are not in consecutive order.

Can someone please let me know the correct code to generate the correct lagged values of the independent variable?

Thank you.
Tags: lagged effect, panel data, subscripts, time series operator
Andrew Musau

Join Date: Oct 2014

Posts: 10213
#2

06 Jul 2022, 04:59

Your panel is identified by firm, so country is irrelevant here. With

xtset firm_id year

the first lag of variable V corresponding to firm X in 2007 is the value of V for firm X in 2006. Thus

Code:

g lagV= L.V

is equivalent to

Code:

bys firm (year): g lagV= V[_n-1]

if and only if the panel is balanced (no holes). To ensure this, you first need

Code:

xtset firm_id year tsfill

Last edited by Andrew Musau; 06 Jul 2022, 05:19.
Comment
Ama Perera

Join Date: Mar 2019

Posts: 43
#3

06 Jul 2022, 19:29

Originally posted by Andrew Musau View Post

Your panel is identified by firm, so country is irrelevant here. With

the first lag of variable V corresponding to firm X in 2007 is the value of V for firm X in 2006. Thus

Code:

g lagV= L.V

is equivalent to

Code:

bys firm (year): g lagV= V[_n-1]

if and only if the panel is balanced (no holes). To ensure this, you first need

Code:

xtset firm_id year tsfill

Andrew, thanks for your reply.

Variable 'V' is a country-level variable, therefore the first lag of variable V corresponding to firm X in 2007 can be the value of V not only for firm X in 2006, but also the value for any firm in 2006 located in country 'Y'.

It isn't possible through either of the above codes.

Thanks.
Comment

Fei Wang

Join Date: Oct 2021
Posts: 726

06 Jul 2022, 20:55

Ama, you may try the code below.

Code:

bys country (year): gen ind_lag = ind[_n-1] if year-year[_n-1]==1
bys country year (ind_lag): replace ind_lag = ind_lag[1]
sort country firm_id year

Comment

Ama Perera

Join Date: Mar 2019

Posts: 43
#5

06 Jul 2022, 23:38

Originally posted by Fei Wang View Post

Ama, you may try the code below.

Code:

bys country (year): gen ind_lag = ind[_n-1] if year-year[_n-1]==1 bys country year (ind_lag): replace ind_lag = ind_lag[1] sort country firm_id year

This worked perfectly. Thanks a lot Fei Wang
Comment
Zainab Mariam

Join Date: Jul 2022

Posts: 51
#6

11 Jul 2022, 10:33

Hello,

I am using Stata 14. The data type of my research is panel data (unbalanced); the time period is 22 years; I have annual data for 5084 firms. My model includes 10 explanatory variables (X1, X2, …, X10) where nine of these 10 explanatory variables are lagged one year, while only one explanatory variable is in the current time/present period (i.e., not lagged. It is in time t).

Therefore, I kindly ask you please about the command I should use to express the lagged explanatory variables in my regression (where all the explanatory variables are lagged except one explanatory variable).

Thank you in advance.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10213

11 Jul 2022, 12:02

See

Code:

help tsvarlist

The lag operator is "L." and can be used once your data is tsset: see

Code:

help tsset

So in the Grunfeld dataset, if I wanted to take the first lag of capital stock and time, but not market value:

Code:

webuse grunfeld, clear
tsset company year
xtreg invest L.(kstock time) mvalue, fe

Res.:

Code:

. xtreg invest L.(kstock time) mvalue, fe

Fixed-effects (within) regression               Number of obs     =        190
Group variable: company                         Number of groups  =         10

R-sq:                                           Obs per group:
     within  = 0.7169                                         min =         19
     between = 0.8140                                         avg =       19.0
     overall = 0.7882                                         max =         19

                                                F(3,177)          =     149.38
corr(u_i, Xb)  = -0.3436                        Prob > F          =     0.0000

------------------------------------------------------------------------------
      invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      kstock |
         L1. |   .3825641   .0299213    12.79   0.000     .3235156    .4416125
             |
        time |
         L1. |  -2.248589   1.049511    -2.14   0.034    -4.319754    -.177424
             |
      mvalue |   .1246594   .0136119     9.16   0.000     .0977969     .151522
       _cons |  -63.16327   15.98739    -3.95   0.000    -94.71371   -31.61284
-------------+----------------------------------------------------------------
     sigma_u |  94.772595
     sigma_e |   57.99268
         rho |   .7275697   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(9, 177) = 41.06                     Prob > F = 0.0000

.

Comment

Zainab Mariam

Join Date: Jul 2022

Posts: 51
#8

17 Jul 2022, 07:21

Dear Professor Andrew,
Thank you for your reply.

To check my understanding, you mean that I need to type in Stata the following:

xtreg y L.(x1 x2 x3 x4 x5 x6 x7 x8 x9) x10, fe

Given y is the dependent variable and my model includes 10 explanatory variables where 9 of these 10 explanatory variables are lagged one year, while only one explanatory variable is in the current time/present period (i.e., not lagged. It is in time t). As the command includes the lag operator "L." for the nine explanatory variables (x1, x2, x3, x4, x5, x6, x7, x8, x9), Stata will regress y on the lagged values of these nine explanatory variables, but on x10 in the current time.
Is my understanding correct?

My second question is: what is the command to get the second lag of a variable i.e., to lag a variable two periods?

My third question is: is there any difference between the different commands of the lagged variable? If so, which is the best command of the lagged variable?

Your help and cooperation are highly appreciated.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10213
#9

17 Jul 2022, 10:20

Originally posted by Zainab Mariam View Post

To check my understanding, you mean that I need to type in Stata the following:

xtreg y L.(x1 x2 x3 x4 x5 x6 x7 x8 x9) x10, fe

Given y is the dependent variable and my model includes 10 explanatory variables where 9 of these 10 explanatory variables are lagged one year, while only one explanatory variable is in the current time/present period (i.e., not lagged. It is in time t). As the command includes the lag operator "L." for the nine explanatory variables (x1, x2, x3, x4, x5, x6, x7, x8, x9), Stata will regress y on the lagged values of these nine explanatory variables, but on x10 in the current time.
Is my understanding correct?

The assumption is that all variables are in levels, but you want to regress the first lag of each of x1-x9 and the level of x10 on y. In such a case, your understanding is correct.

My second question is: what is the command to get the second lag of a variable i.e., to lag a variable two periods?

Code:

L2.var

is the 2nd lag of the variable "var". "L3.var" is the 3rd lag of "var", and so on.

My third question is: is there any difference between the different commands of the lagged variable? If so, which is the best command of the lagged variable?

The only requirement with lagging is that there is a time dimension. So the data should be time-series or a panel. The estimator will depend on other considerations, not the lagging per se. Usually, lagging variables is justified by theory or common sense. A mayor of a city may want to reduce crime and goes about doing so by expanding the city's police force. However, the effect of expanding the police force on crime may not be immediate. In this case, there is a lagged effect and a model that relates these two variables may include a lagged variable for this reason.

Last edited by Andrew Musau; 17 Jul 2022, 10:25.
Comment
Zainab Mariam

Join Date: Jul 2022

Posts: 51
#10

18 Jul 2022, 10:12

Dear Professor Andrew,

Thank you for your reply.

What I meant by my third question is that I read that there are different commands of the lagged variable. For instance,

gen w1 = L1.var
gen w2 = var[_n-1]

Thus, I asked whether there are any differences between the different commands of the lagged variable, and if so, which command of the lagged variable is the best.

I do appreciate your cooperation.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#11

18 Jul 2022, 10:22

Yes, there is a difference between these. Or, more precisely, there can be a difference, depending on your data.

If your time series has no time gaps, then these will produce the same results. But if there are time gaps, then L1.var is always the 1 year (or whatever time unit) lag of var, or missing value if the preceding year's data does not exist in the data set, whereas var[_n-1] will be the value of var in the most recent preceding year found in the data, which could be many years earlier if several years of data are missing.

Since you might have gaps in your data that you are not aware of, it is safer to use the L1.var notation.

As an aside, things like L1.var or var[_n-1] are called expressions, not commands.
Comment
Zainab Mariam

Join Date: Jul 2022

Posts: 51
#12

18 Jul 2022, 10:38

Dear Professor Clyde,

Thank you for your swift reply and for correcting my terms.

I do appreciate your cooperation.
Comment
Ama Perera

Join Date: Mar 2019

Posts: 43
#13

26 Oct 2022, 18:49

Originally posted by Fei Wang View Post

Ama, you may try the code below.

Code:

bys country (year): gen ind_lag = ind[_n-1] if year-year[_n-1]==1 bys country year (ind_lag): replace ind_lag = ind_lag[1] sort country firm_id year

Hi Fei Wang

Can you please share with me how to adjust the above code to generate two year lag?

Many thanks.
Comment

Announcement

Lagged variable - time series operator vs subscript

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment