Hello,
For my thesis I investigate the relationship between bank credit ratings and financial variables of banks'. I have (quarterly) panel data of the period 2001-2014.
There are a few problems regarding this panel data set:
First: I want to analyze the data in quarters, however dates are stored in the form MD20Y. I read a few post on this forum and in the stata guide as well. In order to convert the dates to year quarter form (e.g. 2002Q2) I performed the following:
generate date= date(Data, "DM20Y")
format %-tq date
Altough the order of my observations are ordered in a right way now, the new generated data variable is not. As you can see at the end of this post, at the bottom*. (as it is not good readable in this way, I attached a screenshot.)
DataDate Data date
310301 310301 5726q2
Where Data is the string variable of DataDate and is in DM20Y form, so the 31th of March 2001.
I know that the SIF coding of stata dates is from January 1960, so in the case of Quarterly data a 2 would represent second quarter of 1960. It looks like the same thing is happening in my dataset now, while I formatted the dates to display in %tq format (HRF). What is the problem here? And is it a problem if my data is ordered in the right way anayway (on the other hand, i xstet my data on ID date (so the 5726q2))?
Second: This problem might relate to the first problem. I want to generate a lag variable of assets. But when I use the following code, i have a lot of missing values:
xtset gvkey time
gen lagassets= D.AssetsTotal
So first I thought that the problem arose, because I had gaps in my dataset. So i tried to use tsfill, but still I only got missing values when I generated the lag variable. Then i found a other code to take care of gaps and that is the following:
sort panel date
by panel: gen time = _n
xtset panel time
After I used this command,tthere were no gaps in my data anymore. However, I still got a lot of missing values.
In the end I found a solution to this, which relates to problem 3, but I am doubting if this is valid to do:
When I looked at my data, I noticed that some data were not matched. So some of the credit ratings of banks, were not matched to their financial variables (because these financial variables were not available for this bank).
When I dropped the non-matched data, and only kept the matched data, the lagged command finally worked: I have data of lagged assets now, however the data of non-matched variables are dropped.
This relates to the Third problem:
If you want to relate a variable Y= variable of interest + control variable, can you delete unmatched data in Stata? In my case it is obvious that when performing a regression, it is useless to have a credit rating of bank A but no financial variables of bank A or the other way around.
However, when I perform descriptive statistics for example, i get a result of the whole data set (including matched variables). So is it valid to drop non matched variables, in order to solve the issue described in problem 2?
I hope you can help me out. I know that it is not ideal to post multiple questions and problems in one post, but as you can see this problems relate to each other.
Many thanks in advance to help me out.
Yannick
* This is the way how my data looks. Gvkey is the ID (Banks). As you can see date is in a weird format and not corresponding to CalandarDataYearandQuarter.
gvkey SP DataDate Data date CalendarDataYearandQuarter
1619 A- 310301 310301 5726q2 2001Q1
1619 A- 300601 300601 5749q1 2001Q2
1619 A- 300901 300901 5772q1 2001Q3
1619 A- 311201 311201 5795q1 2001Q4
1619 A- 310302 310302 5817q3 2002Q1
1619 A- 300602 300602 5840q2 2002Q2
1619 A- 300902 300902 5863q2 2002Q3
For my thesis I investigate the relationship between bank credit ratings and financial variables of banks'. I have (quarterly) panel data of the period 2001-2014.
There are a few problems regarding this panel data set:
First: I want to analyze the data in quarters, however dates are stored in the form MD20Y. I read a few post on this forum and in the stata guide as well. In order to convert the dates to year quarter form (e.g. 2002Q2) I performed the following:
generate date= date(Data, "DM20Y")
format %-tq date
Altough the order of my observations are ordered in a right way now, the new generated data variable is not. As you can see at the end of this post, at the bottom*. (as it is not good readable in this way, I attached a screenshot.)
DataDate Data date
310301 310301 5726q2
Where Data is the string variable of DataDate and is in DM20Y form, so the 31th of March 2001.
I know that the SIF coding of stata dates is from January 1960, so in the case of Quarterly data a 2 would represent second quarter of 1960. It looks like the same thing is happening in my dataset now, while I formatted the dates to display in %tq format (HRF). What is the problem here? And is it a problem if my data is ordered in the right way anayway (on the other hand, i xstet my data on ID date (so the 5726q2))?
Second: This problem might relate to the first problem. I want to generate a lag variable of assets. But when I use the following code, i have a lot of missing values:
xtset gvkey time
gen lagassets= D.AssetsTotal
So first I thought that the problem arose, because I had gaps in my dataset. So i tried to use tsfill, but still I only got missing values when I generated the lag variable. Then i found a other code to take care of gaps and that is the following:
sort panel date
by panel: gen time = _n
xtset panel time
After I used this command,tthere were no gaps in my data anymore. However, I still got a lot of missing values.
In the end I found a solution to this, which relates to problem 3, but I am doubting if this is valid to do:
When I looked at my data, I noticed that some data were not matched. So some of the credit ratings of banks, were not matched to their financial variables (because these financial variables were not available for this bank).
When I dropped the non-matched data, and only kept the matched data, the lagged command finally worked: I have data of lagged assets now, however the data of non-matched variables are dropped.
This relates to the Third problem:
If you want to relate a variable Y= variable of interest + control variable, can you delete unmatched data in Stata? In my case it is obvious that when performing a regression, it is useless to have a credit rating of bank A but no financial variables of bank A or the other way around.
However, when I perform descriptive statistics for example, i get a result of the whole data set (including matched variables). So is it valid to drop non matched variables, in order to solve the issue described in problem 2?
I hope you can help me out. I know that it is not ideal to post multiple questions and problems in one post, but as you can see this problems relate to each other.
Many thanks in advance to help me out.
Yannick
* This is the way how my data looks. Gvkey is the ID (Banks). As you can see date is in a weird format and not corresponding to CalandarDataYearandQuarter.
gvkey SP DataDate Data date CalendarDataYearandQuarter
1619 A- 310301 310301 5726q2 2001Q1
1619 A- 300601 300601 5749q1 2001Q2
1619 A- 300901 300901 5772q1 2001Q3
1619 A- 311201 311201 5795q1 2001Q4
1619 A- 310302 310302 5817q3 2002Q1
1619 A- 300602 300602 5840q2 2002Q2
1619 A- 300902 300902 5863q2 2002Q3

Comment