Panel data analysis

Vasilis Papas

Join Date: Apr 2021
Posts: 44

Panel data analysis

01 May 2021, 15:58

Hello everyone, i am working on a mothly panel data of 36 banks from 2011-2019. My data look sth like this

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte bankid float months long(month total_assets) float(NIRP assets_lagged Loan_ratio RD_ratio EL_ratio)
6 21762 201908 18967 1 18751 .7217748 .8713028  .002530711
6 21793 201909 19006 1 18967 .7266304 .8660949  .002420288
6 21823 201910 19241 1 19006 .7492897 .8650798 .0022348112
6 21854 201911 19404 1 19241 .7601995 .8693053 .0023191094
6 21884 201912 19254 1 19404 .7626263 .8672484  .003272047
7 18628 201101 12242 0 19254 .4247429 .9005882           0
7 18659 201102 12145 0 12242 .6738278 .9258131           0
7 18687 201103 12206 0 12145 .6738576 .9454367           0
7 18718 201104 12323 0 12206   .67262 .9424653           0
7 18748 201105 12341 0 12323 .6674511 .9431164           0
7 18779 201106 12409 0 12341 .6614537 .9466516           0
end
format %d months

1. I want to draw a line graph for some of my variables over time (e.g Loan_ratio) shown the average value of Loan_ratio for all the banks that have RD_ratio (retail deposits to assets) above the median and another line graph for banks with RD_ratio below the median. I assume i have to use the preserve and collapse commands but i dont know how to make this only for those specific bank subgroups.

2. I want to run a regression only on the subgroup of banks with RD_ratio > median and maybe on a subperiod from 2011-2017. Is the correct way to do that this:

Code:

egen median = median(RD_ratio)
xtreg Loan_ratio Loan_ratio_lagged c.EL_ratio_lagged#NIRP i.month if RD_ratio > median & month > (date), fe robust

? Where Loan_ratio_lagged is the lagged value of the variable and NIRP is a dummy with 0 for the period before 2015 and 1 afterwards. Thank you in advance.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#2

01 May 2021, 17:47

or all the banks that have RD_ratio (retail deposits to assets) above the median and another line graph for banks with RD_ratio below the median

What does this mean? A bank could have above median RD_ratio one year and below median RD_ratio another year. And that is true whether you use a single median for all banks in all years, or whether you do a separate median for each year. Please clarify both of these aspects of your question.
Comment
Vasilis Papas

Join Date: Apr 2021

Posts: 44
#3

02 May 2021, 01:01

Originally posted by Clyde Schechter View Post

What does this mean? A bank could have above median RD_ratio one year and below median RD_ratio another year. And that is true whether you use a single median for all banks in all years, or whether you do a separate median for each year. Please clarify both of these aspects of your question.

I would prefer the criteria to be if the bank has RD_ratio > median in that specific month to belong to the group of high RD_ratio. And about the second point you pointed out i would prefer the median to be the median of all banks of that specific month.

However, could it be possible to show me also this if i want the criteria to be if the bank has RD_ratio > median, where the median is calculated for all banks in all months?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#4

02 May 2021, 11:32

Code:

by months, sort: egen median_rd_ratio_this_month = median(RD_ratio) gen byte high_RD_ratio = RD_ratio > median_rd_ratio_this_month if !missing(RD_ratio)

will give you the median RD ratio for all banks separately in each month and an indicator variable distinguishing those above and below the median.

Code:

preserve collapse (mean) Loan_ratio, by(high_RD_ratio months) xtset high_RD_ratio months xtline Loan_ratio restore

will give you graphs over the Loan_ratio over time in the two groups. If you want the two curves on the same panel, rather than one panel for each group, then add the -overaly- option to the -xtline- command.

show me also this if i want the criteria to be if the bank has RD_ratio > median, where the median is calculated for all banks in all months?

Code:

// OTHER CRITERION: MEDIAN FOR ALL BANKS IN ALL MONTHS summ RD_ratio, detail gen byte high_RD_ratio_all_months = RD_ratio > `r(p50)' if !missing(RD_ratio)
Comment
Vasilis Papas

Join Date: Apr 2021

Posts: 44
#5

02 May 2021, 12:03

Dear Clyde Schechter thank you very much for your help! Regarding the second part of my question, in order to run the regression do i have to write the code like this:
For the first case and for a subperiod from 2011-2017:

Code:

xtreg Loan_ratio Loan_ratio_lagged c.EL_ratio_lagged#NIRP i.month if RD_ratio > high_RD_ratio & months > (e.g 21762), fe robust

and for the second case:

Code:

xtreg Loan_ratio Loan_ratio_lagged c.EL_ratio_lagged#NIRP i.month if RD_ratio > high_RD_ratio_all_months & months > (e.g 21762), fe robust
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#6

02 May 2021, 12:22

No, that's not quite right. high_RD_ratio (and the all months versions) are not the median values: they are 0/1 indicator variables for whether the bank is above the median value or not.

Code:

xtreg Loan_ratio Loan_ratio_lagged c.EL_ratio_lagged#NIRP i.months if high_RD_ratio == 1 & months > td(01aug2019), fe robust

This is the correct use of the variable high_RD_ratio. Also, while it is legal to use -months > 21762-, it is poor programming style to have numbers with no obvious meaning flying around in code. When you come to look at this code in the future, you won't know where that number came from and whether you had it correct or not. Since that number is intended to be the numeric date value for 1 Aug 2019, use the -td()- notation instead.

There are some other potential improvements to this code.
If, as the name implies, EL_ratio_lagged is the lagged value of EL_ratio, then eliminate that variable and just use Stata's lag operator with the original.

The variable month in your data is both useless and dangerous. It is useless because it cannot be used for any calculations--it is for human eyes only. And it is dangerous because it is too easy to make an error refer to the variable month where the usable variable months is intended. So drop the variable -month- before you make that mistake.

You are using a daily date variable to encode a date that is actually just a pointer to a month. So better to convert it to a monthly date variable:

Code:

gen mdate = mofd(months) format mdate %tm drop month months xtset bankid mdate xtreg Loan_ratio L1.EL_ratio c.L1.EL_ratio#NIRP i.mdate if high_RD_ratio == 1 & mdate> tm(2019m8), fe robust

Finally, if the variable NIRP is a time-invariant property of bankid, then this code is fine. But if NIRP can vary over time within a bankid, then it is an ill-formed model and needs a separate term for i.NIRP by itself. Actually, without having to worry about that, you can do this in either case:

Code:

xtreg Loan_ratio c.L1.EL_ratio##NIRP i.mdate if high_RD_ratio == 1 & mdate> tm(2019m8), fe robust

This version will automatically generate a separate term for NIRP if it does vary within bankids, and will omit it if it is time-invariant--so you can't go wrong. (It also automatically generates the separate term for L1.EL_ratio.)
1 like
Comment
Vasilis Papas

Join Date: Apr 2021

Posts: 44
#7

02 May 2021, 13:31

Thank you for the things you have pointed out. However i have some questions about two of the comments. Regarding the lagged variables i have used

Code:

gen Loan_ratio_lagged = Loan_ratio[_n-1] gen RD_ratio_lagged = RD_ratio[_n-1] gen EL_ratio_lagged = EL_ratio[_n-1]

in order to generate them. Is this not the best way to do it? Is the lagged operator this one

Code:

generate lag_EL_ratio = L.EL_ratio

Should i use this code for all lagged variables that i want to create? And also, when having missing values is the lagged operator preferable to the other method that i have used?

Also, regarding the NIRP variable, it is a dummy for showing the introduction of negative rates after 02/2015, so it is 0 before 02/2015 and 1 afterwards for all bank. So in this first baseline model i want to estimate how banks that have excess liquidity have managed their lending towards consumers after the implementation of negative rates. The model that i am trying to follow is this:
Comment
Vasilis Papas

Join Date: Apr 2021

Posts: 44
#8

02 May 2021, 15:01

Also, sorry for not stating this earlier when you answered all my previous questions, but if i want to make this part

Code:

// OTHER CRITERION: MEDIAN FOR ALL BANKS IN ALL MONTHS summ RD_ratio, detail gen byte high_RD_ratio_all_months = RD_ratio > `r(p50)' if !missing(RD_ratio)

for a 3 groups, one <p33, one between p33-p66 and one for >p66, what would the code be like?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#9

03 May 2021, 11:33

Code:

gen Loan_ratio_lagged = Loan_ratio[_n-1] gen RD_ratio_lagged = RD_ratio[_n-1] gen EL_ratio_lagged = EL_ratio[_n-1]

in order to generate them. Is this not the best way to do it? Is the lagged operator this one

It's a terrible way to do it, for two reasons. First, if your data contain gaps, X[_n-1] will refer to the previous value that happens to appear in the data set, even though it may be many periods earlier. Second, even if your data is perfectly balanced with no gaps, in every case this will calculate a "lagged variable" for the first observation of each bankid that is, in reality, the final observation of the preceding bank!

Code:
generate lag_EL_ratio = L.EL_ratio

Should i use this code for all lagged variables that i want to create? And also, when having missing values is the lagged operator preferable to the other method that i have used?

Yes. Whenever you want a lagged variable, use Stat's lag operator: it always gives you correct results. I should also point out that for most purposes there is no need to create a separate variable for the lag anyway. For example, when you are using lagged variables in a regression, just use L.EL_ratio directly in the regression command. You don't need a separate variable that contains the lagged values for this purpose. You might need a separate variable for some other purposes, but I would always try using the lag operator first, and only resort to creating a separate variable if the command rejects the use of time-series operators when you try it.

Also, regarding the NIRP variable, it is a dummy for showing the introduction of negative rates after 02/2015, so it is 0 before 02/2015 and 1 afterwards for all bank. So in this first baseline model i want to estimate how banks that have excess liquidity have managed their lending towards consumers after the implementation of negative rates. The model that i am trying to follow is this:...

Since this variable does change over time within banks (specifically, it changes after 2/2015) it needs to be included in the model in its own right, not just as a constituent of an interaction. So you need to use the ## notation I showed in order to have a properly specified model. As for the model equation you show, I don't follow it at all. On the one hand, NIR is appearing as an exponent, so this looks like you need a non-linear model. Moreover parameterizing the model with separate terms involving 1-D^NIR and D^NIR does not make any sense to me. It also involves a number of variables that, from their names, I do not recognize in your code and data. So I'm not sure what to tell you about this.

...if i want to make this part

Code:
// OTHER CRITERION: MEDIAN FOR ALL BANKS IN ALL MONTHS
summ RD_ratio, detail gen byte high_RD_ratio_all_months = RD_ratio > `r(p50)' if !missing(RD_ratio)

for a 3 groups, one <p33, one between p33-p66 and one for >p66, what would the code be like?

Well, -summ- does not compute 33rd and 66th percentiles. But the -centile- command does. So:

Code:

centile RD_ratio, centiles(33 66) local p33 `r(c_1)' local p66 `r(c_2)' gen byte RD_ratio_group_all_months = 0 if RD_ratio < `p33' replace RD_ratio_group_all_months = 1 if inrange(RD_ratio, `p33', `p66') replace RD_ratio_group_all_months = 2 if RD_ratio > `p66' & !missing(RD_ratio)
Comment
Vasilis Papas

Join Date: Apr 2021

Posts: 44
#10

04 May 2021, 14:26

Dear Clyde Schechter thank you very much for your time and help and for the extra things you have pointed out to me! They all have been really helpful.

As for the model, D^NIR is just an indicator for the dummy. From the other variables that they use i don't have data for two of them. However, 1-D^NIR and D^NIR also troubled me, as i believe one of the terms will be removed automatically due to multicolinearity.
Comment

Vasilis Papas

Join Date: Apr 2021
Posts: 44

#11

05 May 2021, 13:58

Dear Clyde Schechter i have some more questions that came up. I have gotten these results from one of my regressions:

Code:

. xtreg total_lending c.L1.EL##NIRP##c.L1.Households_dep c.L1.EL#NIRP#c.L1.Households_dep L1.Deposits_Riksbank L1.Debt_sec_ra
> tio i.mdate if high_HH_dep_all_months == 1 & mdate< tm(2018m12), fe vce(robust) /* full regression for subgroup 1, and peri
> od 2011-2018 */
note: 706.mdate omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      1,340
Group variable: bankid                          Number of groups  =         20

R-sq:                                           Obs per group:
     within  = 0.3352                                         min =          1
     between = 0.2820                                         avg =       67.0
     overall = 0.2604                                         max =         94

                                                F(18,19)          =          .
corr(u_i, Xb)  = 0.2317                         Prob > F          =          .

                                                (Std. Err. adjusted for 20 clusters in bankid)
----------------------------------------------------------------------------------------------
                             |               Robust
               total_lending |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------------------+----------------------------------------------------------------
                          EL |
                         L1. |  -2.350226   1.352385    -1.74   0.098    -5.180802    .4803489
                             |
                      1.NIRP |  -3083.538   29063.59    -0.11   0.917    -63914.32    57747.25
                             |
                  NIRP#cL.EL |
                          1  |   2.022824   1.549003     1.31   0.207    -1.219276    5.264924
                             |
              Households_dep |
                         L1. |   .3024698   .8856131     0.34   0.736     -1.55114    2.156079
                             |
     cL.EL#cL.Households_dep |   .0000178   8.43e-06     2.11   0.048     1.63e-07    .0000355
                             |
      NIRP#cL.Households_dep |
                          1  |   .3856121   .2049208     1.88   0.075    -.0432922    .8145164
                             |
NIRP#cL.EL#cL.Households_dep |
                          1  |  -.0000182   .0000104    -1.75   0.096    -.0000398    3.53e-06
                             |
           Deposits_Riksbank |
                         L1. |  -8.233764    8.90753    -0.92   0.367    -26.87744    10.40991
                             |
              Debt_sec_ratio |
                         L1. |    74953.3   71783.13     1.04   0.310    -75290.52    225197.1

Firstly, if i use the lagged variable of the dependent variable as a regresson, all my R^2 will reach 0.9+ values, should i still use it?
Also, my F-test has no values. I read that if all my standard errors appear normally then there could be no problem to worry about, but i wanted to double check.
Also, in some other regression that i run, whenever i have an interraction term with my dummy NIRP, in the results appear the coefficients for both the values 0 and 1 of NIRP, however in these results i dont get two results for interraction terms with NIRP. Here is an example:

Code:

. xtreg Loan_ratio Loan_ratio_lagged c.EL_ratio_lagged#NIRP c.EL_ratio_lagged#NIRP#c.RD_ratio_lagged NIRP#c.RD_ratio_lagged i
> .bankid i.mdate, fe robust

Fixed-effects (within) regression               Number of obs     =      2,770
Group variable: bankid                          Number of groups  =         33

R-sq:                                           Obs per group:
     within  = 0.5201                                         min =         11
     between = 0.9951                                         avg =       83.9
     overall = 0.9557                                         max =        106

                                                F(32,32)          =          .
corr(u_i, Xb)  = 0.9421                         Prob > F          =          .

                                                            (Std. Err. adjusted for 33 clusters in bankid)
----------------------------------------------------------------------------------------------------------
                                         |               Robust
                              Loan_ratio |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------------------------------+----------------------------------------------------------------
                       Loan_ratio_lagged |    .704402   .0888594     7.93   0.000     .5234013    .8854027
                                         |
                  NIRP#c.EL_ratio_lagged |
                                      0  |  -.1582787   .2575225    -0.61   0.543    -.6828349    .3662774
                                      1  |   -.038262   .2221044    -0.17   0.864    -.4906739    .4141499
                                         |
NIRP#c.EL_ratio_lagged#c.RD_ratio_lagged |
                                      0  |   .3007768   .4791904     0.63   0.535    -.6753021    1.276856
                                      1  |   .0668993   .5672858     0.12   0.907    -1.088624    1.222423
                                         |
                  NIRP#c.RD_ratio_lagged |
                                      0  |  -.0273265   .0360759    -0.76   0.454    -.1008108    .0461577
                                      1  |  -.0449217    .034182    -1.31   0.198    -.1145482    .0247049

Finally, in the first results that i posted i get some weird results for cofficients and standard erros like this: 8.43e-06. Can this be interpreted or have i done something completely wrong that i need to change?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#12

05 May 2021, 14:17

Firstly, if i use the lagged variable of the dependent variable as a regresson, all my R^2 will reach 0.9+ values, should i still use it?

So, the dependent variable is strongly predicted by its lagged variable--that's not unusual or surprising. However, it raises questions such as whether -xtreg- is a suitable estimator for this model. You might need a dynamic panel-data model for this. This kind of situation is not common in my discipline (epidemiology) and I don't know much about it.

Also, my F-test has no values. I read that if all my standard errors appear normally then there could be no problem to worry about, but i wanted to double check.

That is correct--nothing to worry about here.

Also, in some other regression that i run, whenever i have an interraction term with my dummy NIRP, in the results appear the coefficients for both the values 0 and 1 of NIRP, however in these results i dont get two results for interraction terms with NIRP.

The model you show that gives you two results for interaction terms looks that way because your model fails to include i.NIRP by itself in the list of predictor variables. This model is ill-formed. If you make a habit of using the ## notation instead of # when you put interaction terms into regressions you won't make that mistake.

Code:

Finally, in the first results that i posted i get some weird results for cofficients and standard erros like this: 8.43e-06. Can this be interpreted or have i done something completely wrong that i need to change?

Well, the coefficient for that term is also very small. All that tells you as that this interaction is very small, but your data has a lot of information about it so your estimate of it is rather precise. I wouldn't worry about it.

Last edited by Clyde Schechter; 05 May 2021, 14:21.
1 like
Comment

Vasilis Papas

Join Date: Apr 2021
Posts: 44

#13

05 May 2021, 14:34

Clyde Schechter However, If i get such a number like -9.53e-07for my coefficients can it be interpreted? For example in this regression:

Code:

. xtreg total_lending L1.total_lending c.L1.EL##NIRP c.L1.EL#NIRP#c.L1.Households_dep NIRP##c.L1.Households_dep i.bankid i.md
> ate if high_RD_ratio_all_months == 1 & mdate< tm(2018m12),fe  robust /* full regression for subgroup 1, and period 2011-201
> 8 */


Fixed-effects (within) regression               Number of obs     =      1,097
Group variable: bankid                          Number of groups  =         18

R-sq:                                           Obs per group:
     within  = 0.9671                                         min =          1
     between = 0.9997                                         avg =       60.9
     overall = 0.9989                                         max =         94

                                                F(16,17)          =          .
corr(u_i, Xb)  = 0.9801                         Prob > F          =          .

                                                (Std. Err. adjusted for 18 clusters in bankid)
----------------------------------------------------------------------------------------------
                             |               Robust
               total_lending |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------------------+----------------------------------------------------------------
               total_lending |
                         L1. |   .9445454   .0229484    41.16   0.000     .8961284    .9929623
                             |
                          EL |
                         L1. |   .2332589   .1137809     2.05   0.056    -.0067979    .4733156
                             |
                      1.NIRP |    2310.25   1308.861     1.77   0.096    -451.2046    5071.705
                             |
                  NIRP#cL.EL |
                          1  |  -.3934421   .1437637    -2.74   0.014    -.6967571   -.0901272
                             |
NIRP#cL.EL#cL.Households_dep |
                          0  |  -9.53e-07   5.17e-07    -1.84   0.083    -2.04e-06    1.38e-07
                          1  |   4.46e-07   2.64e-07     1.69   0.109    -1.11e-07    1.00e-06
                             |
              Households_dep |
                         L1. |  -.0103126   .0181993    -0.57   0.578    -.0487097    .0280846
                             |
      NIRP#cL.Households_dep |
                          1  |   .0116478   .0054442     2.14   0.047     .0001617     .023134

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#14

05 May 2021, 14:42

The two coefficients for NIRP#cL.EL#cL.Households_dep are extremely small--and they are literally 7 orders of magnitude smaller than the rest of the coefficients in your model. So unless the variables EL or Households_dep range up as high as 10⁷, the contribution of this three-way interaction is negligible. So you can interpret it as "three way interaction too small to bother with." In fact, if I were you, I would re-estimate the model omitting it, and just having the two-way interactions:

Code:

xtreg total_lending L1.total_lending i.NIRP##(c.L1.Households_dep c.L1.EL) i.bankid i.mdate if high_RD_ratio_all_months == 1 & mdate< tm(2018m12),fe robust /* full regression for subgroup 1, and period 2011-2018
1 like
Comment
Vasilis Papas

Join Date: Apr 2021

Posts: 44
#15

05 May 2021, 14:58

I see, that makes sense. Dear Clyde Schechter Thank you again very much for all your help!
Comment

Announcement

Panel data analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment