Panel data analysis

Vasilis Papas

Join Date: Apr 2021

Posts: 44
#16

08 May 2021, 08:25

Dear Clyde Schechter, i am sorry to bother you again. Some of my banks are not fianced through deposits, so my variables RD_ratio and total_deposits are equal to 0. If i want to work only with those banks that have total deposits > 0, is this the correct way to adjust the above commands?

Code:

by mdate, sort: egen median_rd_ratio_this_month = median(RD_ratio) if RD_ratio > 0 gen byte high_RD_ratio = RD_ratio > median_rd_ratio_this_month if !missing(RD_ratio) preserve collapse (mean) Loan_ratio, by(high_RD_ratio mdate) xtset high_RD_ratio mdate xtline Loan_ratio restore

and

Code:

// OTHER CRITERION: MEDIAN FOR ALL BANKS IN ALL MONTHS summ RD_ratio, detail gen byte high_RD_ratio_all_months = RD_ratio > `r(p50)' if !missing(RD_ratio) & RD_ratio > 0 preserve collapse (mean) Loan_ratio, by(high_RD_ratio_all_months mdate) xtset high_RD_ratio_all_months mdate xtline Loan_ratio restore

Code:

xtreg Loan_ratio Loan_ratio_lagged c.L1.EL_ratio##NIRP##c.L1.RD_ratio L1.Dep_Riks_ratio L.1certificates_ratio i.bankid i.mdate if high_RD_all_months == 1 & mdate< tm(2018m12) & RD_ratio > 0,fe vce(cluster bankid)

Also, some of my banks (2-3 banks) have reported values only for a small period and for the rest of the months they have missing values. Can this bias my results? What i mean is, e.g a bank with high RD_ratio has reported values for the period 2011-2015 and then there are only missing values. If i want to examine the effect of negative rates after 2015 on lending, could this bias my results as, for example, i would have the lending volumes of 20 of banks before 2015 and only 19 after 2015, since the missing observations after 2015 won't be talen into account by stata? Should i just drop these banks since they have so many missing values? I hope it is not a silly and unclear question
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#17

08 May 2021, 11:45

Some of my banks are not fianced through deposits, so my variables RD_ratio and total_deposits are equal to 0. If i want to work only with those banks that have total deposits > 0, is this the correct way to adjust the above commands?
by mdate, sort: egen median_rd_ratio_this_month = median(RD_ratio) if RD_ratio > 0

You might do it that way. But I think you are setting yourself up for later problems. You will have missing values in the median_RD_ratio variable, and those can mess up subsequent -if- conditions because they would be considered greater than any non-missing value. Yes, you can code around that with more complicated -if- conditions, but you are likely to overlook this somewhere down the line. If you want to exclude these banks from the analysis, it is best to just drop them from the data set in the first place. You will also have to decide, if you have a bank with RD_ratio sometimes 0 and sometimes > 0, whether to drop only the observations where it is 0, or exclude the bank altogether. This is somewhat related to your second question.

But if you need to retain these banks because they are needed for other aspects of the analysis, then this is a reasonable way to go. But then you have a problem with:

Code:

gen byte high_RD_ratio = RD_ratio > median_rd_ratio_this_month if !missing(RD_ratio)

The problem is that for a bank with RD_ratio = 0, median_rd_ratio_this_month will have a missing value. Then the expression RD_ratio > median_rd_ratio_this_month translates to 0 > ., which is false since missing value in Stata is larger than all non-missing values. Moreover, RD_ratio, being 0, is not missing. So these observations will have high_RD_ratio = 0, and therefore they will be included in analyses that look at the variable high_RD_ratio. So you need to modify this to:

Code:

gen byte high_RD_ratio = RD_ratio > median_rd_ratio_this_month if !missing(RD_ratio, median_rd_ratio_this_month)

That will cause high_RD_ratio to be missing, which is better.

Next,

Code:

// OTHER CRITERION: MEDIAN FOR ALL BANKS IN ALL MONTHS summ RD_ratio, detail gen byte high_RD_ratio_all_months = RD_ratio > `r(p50)' if !missing(RD_ratio) & RD_ratio > 0

will do something different that, I think, is not what you want. In this case, the observations with RD_ratio = 0 are included in the calculation of the median, but then are classified as not having a high RD ratio all months. Is that what you want? If so, go with it--but it is different from what the code in your the median for this month commands was aimed at doing. If what you want to do is like what you wrote for the this month median commands, then it should be:

Code:

summ RD_ratio if RD_ratio > 0, detail gen byte high_RD_ratio_all_months = RD_ratio > `r(p50)' if !missing(RD_ratio) & RD_ratio > 0

Also, some of my banks (2-3 banks) have reported values only for a small period and for the rest of the months they have missing values. Can this bias my results?
...
Should i just drop these banks since they have so many missing values?

This is actually a deep and complicated question. The issue, however, is not having 20 values here and 19 values there. That's not important. What matters is why are those values missing? If the missingness of those values is related to the actual values that we would have if they were not missing, then, yes that is a source of bias in the analysis, but removing those banks may not solve the problem anyway because perhaps the banks that don't report those values are different from those that do in a relevant way. Anyhow, this is primarily a question about bank data reporting practices, and you need to consult somebody with expertise in that area for an answer. If the missingness is just a random (in the technical, statistical sense of the word) phenomenon, then there is no bias created and you can leave your data set as is.

If it is not a random phenomenon, however, then you are likely to have a bias problem no matter what you do. Missing data is a problem to which there are no good solutions--one tries to find the least bad solution for your particular situation. There are a number approaches that are used, all with serious limitations. It is too complicated to go into here. You might take a look at https://statisticalhorizons.com/wp-c...aterials-1.pdf.
1 like
Comment
Vasilis Papas

Join Date: Apr 2021

Posts: 44
#18

08 May 2021, 15:10

Originally posted by Clyde Schechter View Post

If what you want to do is like what you wrote for the this month median commands, then it should be:

Code:

summ RD_ratio if RD_ratio > 0, detail gen byte high_RD_ratio_all_months = RD_ratio > `r(p50)' if !missing(RD_ratio) & RD_ratio > 0

This is exactly what i wanted. Thank you very much! and i assume that since we adjust the groups when i form them, the commands for the diagrams remain the same

Code:

preserve collapse (mean) loan_ratio, by(high_RD_ratio_all_months mdate) xtset high_RD_ratio_all_months mdate xtline loan_ratio restore

Thank you also for all the information about the missing values!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#19

08 May 2021, 15:44

...and i assume that since we adjust the groups when i form them, the commands for the diagrams remain the same

Thats right.
1 like
Comment

Vasilis Papas

Join Date: Apr 2021
Posts: 44

#20

13 May 2021, 14:18

Dear Clyde Schechter I hope i am not being a bother with all these questions and hopefully these will be my last ones. I came across some papers that use a different method to group the high_RD_ratio banks. What i want is if a bank during the whole 2014 has average RD_ratio larger than the 2014 median of all the banks with RD_ratio > 0, then to belong to the group of high_RD_ratio banks for the whole period of my panel. I have used this code but it doesn't seem to work and the egen i have written seems to produce only missing values.

Code:

// Group of high RD_ratio banks if a bank in 2014 has average RD_ratio > median of the banks with RD_ratio > 0 for the year 2014
bysort bankid: egen meangRD_ratio9 = mean(RD_ratio) if !missing(RD_ratio) & inrange(mdate,tm(2014m1),tm(2014m12))  /* calculating the mean of every bank for the period 01.2014 - 12.2014 */

centile RD_ratio, centile(50) if inrange(mdate,tm(2014m1),tm(2014m12)) & RD_ratio > 0 /* calculating the median for the banks with RD_ratio > 0 for the whole 2014 */

local p50 `r(p_1)' /* storing the median */

gen byte high_RD_ratio_2014 = RD_ratio > `r(p_1)' if !missing(RD_ratio) & inrange(mdate,tm(2014m1),tm(2014m12)) & RD_ratio > 0

If i want to do the same for p33 and p66 instead of using the median is this the correct way?

Code:

gen byte high_RD_ratio_2014 = RD_ratio > `r(p_1)' if !missing(RD_ratio) & inrange(mdate,tm(2014m1),tm(2014m12)) & RD_ratio > 0
centile RD_ratio if inrange(mdate,tm(2014m1),tm(2014m12)), centile(33 66)
local p33 `r(c_1)'
local p66 `r(c_2)'
gen byte RD_ratio_group = 0 if high_RD_ratio_2014 < `p33'
replace RD_ratio_group = 1 if inrange(high_RD_ratio_2014, `p33', `p66')
replace RD_ratio_group = 2 if high_RD_ratio_2014 > `p66' & !missing(EL_ratio)

Also, previously you had given me this code

Code:

centile EL_ratio, centile(33 66)
local p33 `r(c_1)'
local p66 `r(c_2)'
gen byte EL_ratio_group_all_months = 0 if EL_ratio < `p33'
replace EL_ratio_group_all_months = 1 if inrange(EL_ratio, `p33', `p66')
replace EL_ratio_group_all_months = 2 if EL_ratio > `p66' & !missing(EL_ratio)

If i want to do this only for banks with EL_ratio > 0 do i just add this?

Code:

centile EL_ratio if EL_ratio > 0, centile(33 66)

Lastly, if i want to run a regression on the growth rate of the dependent variable i have found two ways that give quite similar results. But since i have quite a lot of missing values and zeros i suppose not using the logarithms is a better option?

Code:

 // Growth rate calculation. 1st way
 bysort bankid: gen g_RD_ratio = (RD_ratio - L1.RD_ratio) / L1.RD_ratio*100

/* Growth rates. 2nd way */
gen l_RD_ratio = ln(RD_ratio)
gen G_RD_ratio2 = D.l_RD_ratio*100

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#21

13 May 2021, 16:38

What i want is if a bank during the whole 2014 has average RD_ratio larger than the 2014 median of all the banks with RD_ratio > 0, then to belong to the group of high_RD_ratio banks for the whole period of my panel.

Code:

summ RD_ratio if year(months) == 2014 & RD_ratio > 0, detail local median_RD_ratio `r(p50)' by bankid, sort: egen high_RD_ratio = min(cond(year(months) == 2014, /// (RD_ratio > `median_RD_ratio' & !missing(RD_ratio)), .))

If i want to do the same for p33 and p66 instead of using the median is this the correct way?

It isn't clear to me what doing the same for p33 and p66 means. Your definition for the median is that the high group is one where all of the 2014 RD_ratio values exceed the all-bank median for 2014, and that is a simple yes-no classification. But to proceed to a three class grouping, this concept falls apart. If you try to define these groups by all of the observations being < 33rd percentile, and all of the observations between 33rd and 66th, and all of the observations > 66th, there will be a lot of observations that are unclassifiable because some observations will fall in different ranges. So please clarify how you want to do this.

But since i have quite a lot of missing values and zeros i suppose not using the logarithms is a better option?

If you have any zeroes at all, even just one, then using logarithms is simply not an option at all. Don't even think about it.

That said, even if all your observations were positive, this logarithm-based formula for growth is a zombie--I don't understand how it continues to survive and be used. First, it is an approximation only--the other formula you give is exact. Second, the logarithm-based formula is more computationally expensive, by a large margin, than the other formula. I don't understand why anyone wants to expend more resources to get a less accurate answer. I don't even understand how the formula ever came into use in the first place, and it is even more mind-boggling that it is still around at all.
Comment
Vasilis Papas

Join Date: Apr 2021

Posts: 44
#22

13 May 2021, 16:58

Originally posted by Clyde Schechter View Post

It isn't clear to me what doing the same for p33 and p66 means. Your definition for the median is that the high group is one where all of the 2014 RD_ratio values exceed the all-bank median for 2014, and that is a simple yes-no classification. But to proceed to a three class grouping, this concept falls apart. If you try to define these groups by all of the observations being < 33rd percentile, and all of the observations between 33rd and 66th, and all of the observations > 66th, there will be a lot of observations that are unclassifiable because some observations will fall in different ranges. So please clarify how you want to do this.

What i want for both the cases with the median and for p33 and p66 is if bank i in 2014 had an average value of RD_ratio larger than the median (or in the second case larger than p66) to be characterized as a high_RD_ratio for the whole period of my panel, and if it is below median (or below p33 in the second case to be a low_RD_ratio bank). So even if it had one month above p66 and the other month below p66, if its average value of RD_ratio is above p66 then it would be considered high_RD_ratio bank. Is this what your does or does it classify the banks on whether they have RD_ratio > median for every month of 2014?

Thank you also very much for the information on the calculations of the growth rates!
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30084

#23

13 May 2021, 17:47

OK, I think I misunderstood what you meant in #22 for the median. Here's some code that handles both the median and the terciles.

Code:

centile RD_ratio if year(months) == 2019, centile(33 50 66)
local p33 = `r(c_1)'
local p50 = `r(c_2)'
local p66 = `r(c_3)'

by bankid, sort: egen mean_RD_ratio_2014 = mean(cond(year(months) == 2019, RD_ratio, .))
by bankid: gen high_RD_ratio = mean_RD_ratio_2014 > `p50' if !missing(mean_RD_ratio_2014)

by bankid: gen RD_ratio_group = 0 if mean_RD_ratio_2014 < `p33'
by bankid: replace RD_ratio_group = 1 if inrange(mean_RD_ratio_2014, `p33', `p66')
by bankid: replace RD_ratio_group = 2 if mean_RD_ratio_2014 > `p66' & !missing(mean_RD_ratio_2014)

Comment

Vasilis Papas

Join Date: Apr 2021

Posts: 44
#24

13 May 2021, 18:08

Thank you very much! However when i run the code i get these results:

centile RD_ratio if year(mdate) == 2019 & RD_ratio > 0, centile(33 50 66)

-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
RD_ratio | 0

. local p33 = `r(c_1)'

. local p50 = `r(c_2)'

. local p66 = `r(c_3)'

.
. by bankid, sort: egen mean_RD_ratio_2014 = mean(cond(year(mdate) == 2019, RD_ratio, .))
(4,536 missing values generated)

. by bankid: gen high_RD_ratio = mean_RD_ratio_2014 > `p50' if !missing(mean_RD_ratio_2014)
(4,536 missing values generated)

.
. by bankid: gen RD_ratio_group = 0 if mean_RD_ratio_2014 < `p33'
(4,536 missing values generated)

. by bankid: replace RD_ratio_group = 1 if inrange(mean_RD_ratio_2014, `p33', `p66')
(0 real changes made)

. by bankid: replace RD_ratio_group = 2 if mean_RD_ratio_2014 > `p66' & !missing(mean_RD_ratio_2014)
(0 real changes made)

.
end of do-file
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#25

13 May 2021, 18:28

Sorry, that 2019 in that -centile- command is a typo: it should be 2014!
Comment

Vasilis Papas

Join Date: Apr 2021
Posts: 44

#26

13 May 2021, 18:36

Yes i noticed that but even with the correct date it shows zero observations for RD_ratio and then all the generated values are missing values

Code:

. centile RD_ratio if year(mdate) == 2014 & RD_ratio > 0, centile(33 50 66)

                                                       -- Binom. Interp. --
    Variable |       Obs  Percentile    Centile        [95% Conf. Interval]
-------------+-------------------------------------------------------------
    RD_ratio |       0

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#27

13 May 2021, 20:27

Please send example data (using -dataex-, of course) that illustrates the difficulty and I will try to troubleshoot. But before that: check to make sure that your RD_ratio variable is numeric and not string. Many Stata calculations commands will show 0 obs when applied to a string variable.
Comment

Vasilis Papas

Join Date: Apr 2021
Posts: 44

#28

14 May 2021, 01:37

I checked and RD_ratio is float. Also my data look like this

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte bankid str48 banks float mdate long total_assets float(NIRP Dep_Riksbank_ratio total_lending total_deposits EL Loan_ratio EL_ratio RD_ratio)
36 "Swedbank Sjuhärad AB"           715  24410 1 0  29789  28939  1103 1.2215115  .0451864 1.1855388
36 "Swedbank Sjuhärad AB"           716  24399 1 0  29834  29016  1257  1.222204  .0515185  1.189229
36 "Swedbank Sjuhärad AB"           717  24756 1 0  29900  29302  1346   1.22546 .05437066 1.1836323
36 "Swedbank Sjuhärad AB"           718  24665 1 0  30070  29476  1455  1.214655 .05899047 1.1950537
36 "Swedbank Sjuhärad AB"           719  24491 1 0  30142  29202  1375 1.2220556 .05614307 1.1923563
37 "Svenska Handelsbanken AB (Publ)" 612 754745 0 0 276184 502038 33698         . .04464819  .6651757
37 "Svenska Handelsbanken AB (Publ)" 613 731619 0 0 277695 495085 32618  .3679322 .04458331  .6766979
37 "Svenska Handelsbanken AB (Publ)" 614 750352 0 0 278606 506478 47968  .3808075 .06392733  .6749872
37 "Svenska Handelsbanken AB (Publ)" 615 730786 0 0 281661 512492 43526  .3753718 .05956053  .7012888
37 "Svenska Handelsbanken AB (Publ)" 616 712758 0 0 286975 502550 34965  .3926936 .04905592   .705078
37 "Svenska Handelsbanken AB (Publ)" 617 707968 0 0 287125 508277 39087  .4028366 .05521012  .7179378
end
format %tm mdate

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#29

14 May 2021, 10:00

OK, there are two different things going on here.

The first is that this data is different from what you showed early in this thread: you are now using a monthly date variable, mdate, where you previously had a daily date variable called months. (The dates themselves may have been at 1 month intervals, but they were calculated as daily dates, which is a different numeric encoding from what is used for monthly date variables.) The -year()- function only works properly with daily dates as arguments. Having switched to a monthly date variable, the code which worked with the earlier data example is now broken. The fix, however, is simple: before applying the -year()- function to mdate, apply the -dofm()- function to get the corresponding daily date value.

Code:

centile RD_ratio if year(dofm(mdate)) == 2014, centile(33 50 66) local p33 = `r(c_1)' local p50 = `r(c_2)' local p66 = `r(c_3)' by bankid, sort: egen mean_RD_ratio_2014 = mean(cond(year(dofm(mdate)) == 2014, RD_ratio, .)) by bankid: gen high_RD_ratio = mean_RD_ratio_2014 > `p50' if !missing(mean_RD_ratio_2014) by bankid: gen RD_ratio_group = 0 if mean_RD_ratio_2014 < `p33' by bankid: replace RD_ratio_group = 1 if inrange(mean_RD_ratio_2014, `p33', `p66') by bankid: replace RD_ratio_group = 2 if mean_RD_ratio_2014 > `p66' & !missing(mean_RD_ratio_2014)

However, this code still produces only missing values in the data example you give. The second problem is that there is no data from year 2014 in the example. Presumably your real data set has data from year 2014. But you can expect only missing values for any bank that has no 2014 data.
1 like
Comment

Vasilis Papas

Join Date: Apr 2021
Posts: 44

#30

14 May 2021, 11:42

I am sorry i forgot to mention that i had changed my months variable to the one you had suggested. This code works perfectly! Thank you very much! If i want to do this only for the banks with RD_ratio > 0 do i just have to do this?

Code:

centile RD_ratio if year(dofm(mdate)) == 2014 & RD_ratio > 0, centile(33 50 66)
local p33 = `r(c_1)'
local p50 = `r(c_2)'
local p66 = `r(c_3)'

by bankid, sort: egen mean_RD_ratio_2014 = mean(cond(year(dofm(mdate)) == 2014, RD_ratio, .))
by bankid: gen high_RD_ratio = mean_RD_ratio_2014 > `p50' if !missing(mean_RD_ratio_2014) & RD_ratio > 0

by bankid: gen RD_ratio_group = 0 if mean_RD_ratio_2014 < `p33' & RD_ratio > 0
by bankid: replace RD_ratio_group = 1 if inrange(mean_RD_ratio_2014, `p33', `p66') & RD_ratio > 0
by bankid: replace RD_ratio_group = 2 if mean_RD_ratio_2014 > `p66' & !missing(mean_RD_ratio_2014) & RD_ratio > 0

And one last question. If i use p33 and p66 to group my banks, then for my regression i have to run two separate regressions, one with if RD_ratio_group = 2 and one with if RD_ratio_group = 1?

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment