Panel data date structure, Lagged values and deleting unmatched observations

YH jordaan

Join Date: Dec 2015

Posts: 46
#1

Panel data date structure, Lagged values and deleting unmatched observations

18 Dec 2015, 09:10

Hello,

For my thesis I investigate the relationship between bank credit ratings and financial variables of banks'. I have (quarterly) panel data of the period 2001-2014.

There are a few problems regarding this panel data set:

First: I want to analyze the data in quarters, however dates are stored in the form MD20Y. I read a few post on this forum and in the stata guide as well. In order to convert the dates to year quarter form (e.g. 2002Q2) I performed the following:

generate date= date(Data, "DM20Y")
format %-tq date

Altough the order of my observations are ordered in a right way now, the new generated data variable is not. As you can see at the end of this post, at the bottom*. (as it is not good readable in this way, I attached a screenshot.)

DataDate Data date
310301 310301 5726q2

Where Data is the string variable of DataDate and is in DM20Y form, so the 31th of March 2001.

I know that the SIF coding of stata dates is from January 1960, so in the case of Quarterly data a 2 would represent second quarter of 1960. It looks like the same thing is happening in my dataset now, while I formatted the dates to display in %tq format (HRF). What is the problem here? And is it a problem if my data is ordered in the right way anayway (on the other hand, i xstet my data on ID date (so the 5726q2))?

Second: This problem might relate to the first problem. I want to generate a lag variable of assets. But when I use the following code, i have a lot of missing values:

xtset gvkey time
gen lagassets= D.AssetsTotal

So first I thought that the problem arose, because I had gaps in my dataset. So i tried to use tsfill, but still I only got missing values when I generated the lag variable. Then i found a other code to take care of gaps and that is the following:

sort panel date
by panel: gen time = _n
xtset panel time

After I used this command,tthere were no gaps in my data anymore. However, I still got a lot of missing values.
In the end I found a solution to this, which relates to problem 3, but I am doubting if this is valid to do:

When I looked at my data, I noticed that some data were not matched. So some of the credit ratings of banks, were not matched to their financial variables (because these financial variables were not available for this bank).
When I dropped the non-matched data, and only kept the matched data, the lagged command finally worked: I have data of lagged assets now, however the data of non-matched variables are dropped.

This relates to the Third problem:

If you want to relate a variable Y= variable of interest + control variable, can you delete unmatched data in Stata? In my case it is obvious that when performing a regression, it is useless to have a credit rating of bank A but no financial variables of bank A or the other way around.
However, when I perform descriptive statistics for example, i get a result of the whole data set (including matched variables). So is it valid to drop non matched variables, in order to solve the issue described in problem 2?

I hope you can help me out. I know that it is not ideal to post multiple questions and problems in one post, but as you can see this problems relate to each other.

Many thanks in advance to help me out.

Yannick

* This is the way how my data looks. Gvkey is the ID (Banks). As you can see date is in a weird format and not corresponding to CalandarDataYearandQuarter.

gvkey SP DataDate Data date CalendarDataYearandQuarter
1619 A- 310301 310301 5726q2 2001Q1
1619 A- 300601 300601 5749q1 2001Q2
1619 A- 300901 300901 5772q1 2001Q3
1619 A- 311201 311201 5795q1 2001Q4
1619 A- 310302 310302 5817q3 2002Q1
1619 A- 300602 300602 5840q2 2002Q2
1619 A- 300902 300902 5863q2 2002Q3

1 Photo

Last edited by YH jordaan; 18 Dec 2015, 09:12.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

18 Dec 2015, 09:22

I've only read as far as problems 1 and 2. All of this comes because you have mangled the creation of your date variable. The function date() creates a SIF daily date, not a quarterly date. Applying the %tq format doesn't change that: it just makes Stata display the variable in a way that looks like quarterly dates in the distant future. But then you run into problems like big gaps, etc. So you need to do this:

Code:

generate date= qofd(date(Data, "DM20Y")) format %-tq date

Now you can use -xtset gvkey date- and things should work reasonably well. (At least if you find missing values and gaps they will be actual missing values and gaps in your data, not side effects of miscoding the date variable.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#3

18 Dec 2015, 09:23

There are several questions here. I'll peel off one.

When you create a numeric daily date and give it a quarterly display format, you don't recast it as a quarterly date. Let's look at today's date:

Code:

. di mdy(12, 18, 2015) 20440 . di %td mdy(12, 18, 2015) 18dec2015 . di %tq mdy(12, 18, 2015) 7070q1

Today's date is 20440 days after the start of 1960. A daily display format returns what you gave. But a quarterly display format says that as a quarterly date that is 20440/4 = 5110 years after the start of 1960, or 7070, which is not what you want.

Changing a display format just changes what is displayed, not what is stored.

Code:

. di %tq qofd(mdy(12, 18, 2015)) 2015q4

qofd() maps from daily dates to quarterly dates.

Metatip: When in doubt use display on individual dates where you can work out correct answers.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#4

18 Dec 2015, 09:30

Now I've looked at your third problem, which seems unrelated to the first two.

When you do a regression in Stata, the regression is estimated using only those observations that have non-missing values for all of the variables specified in the regression model. This happens automatically--you don't have to do anything explicit.

My sense is that you are looking to generate other statistics using only those observations that will end up being included in the regression model. One way to do that is to run the regression model first and then apply the -if e(sample)- condition to the other calculations. If you want to review the other statistics before you run the regression, then you can use -if !missing(varlist)- where you replace varlist by a comma-separated list of all the variables that will appear in the regression model.

You can also drop those observations with missing values if you wish. But before doing that, I would compare the descriptive statistics on those observations with the descriptive statistics of the observations you will be keeping. You may find that there is an appreciable difference between the groups, leading to potentially biased results. If that is the case, you will need to investigate the mechanisms that generate the missing data values and figure out what to do about it.
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#5

18 Dec 2015, 09:35

Thank you both! Problem 1 has been solved now. I will take a look if problem 2 and 3 are still present.
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#6

18 Dec 2015, 10:18

After implementing the first solution, I run into another problem: when i want to declare my dataset as panel data, i get the: repeated time values within panel r(451) error. Normally I know how to solve this, namely by the following code:

Code:
duplicates report gvkey date
duplicates list gvkey date
duplicates tag gvkey date, gen(isdup)
edit if isdup

When in edit modus, I notice that almost my whole dataset has this problem. This is due from the fact that my credit rating data, specified in quartely data, has the form of either: 31-01- 20XX 28- 02- 20XX 29-02-20XX 31-05-20XX .... 31-10-20XX etc. While my dataset of banks variables (capital, profit etc) only have dates according to quarters: DataDate 31-12-20XX 31-03-20XX 30-06-20XX 30-09-20XX. So only these dates have a match with the credit rating data. Let me be clear, that I understand the problem here regarding to the whole dataset: with respect to the credit rating dates, 28-02-20xx and 29-02-XX and 31-03-20XX, STATA threats these variables as the same date, namely 2001 Q1 for the same ID (Bank). This is a problem. However, when I filter only on my matched variables (_merge==3), then I see an Isdup value of 2 as well, while there are no repeated quarter dates per observation.

Attached two screenshot to make it clear:

EDIT:

Again, when I delete the non matched data (so not a match between credit rating and financial variables), then everything works fine and I get the following result:

. xtset gvkey date, quarterly
panel variable: gvkey (unbalanced)
time variable: date, 2001q1 to 2014q4, but with gaps
delta: 1 quarter

But then again, the same problem as decrebed in my OP: can i delete those non matched observations (data)? @ Clyde I am not interested in statistic of the whole dataset or only the dataset used in the regression. I just want to investigate the relationship between credit rating and financial bank variables. So I provide descriptive statistics of my resaarch, the regression results and regression diagnostics tests.

2 Photos

Last edited by YH jordaan; 18 Dec 2015, 10:31.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#7

18 Dec 2015, 10:35

I don't think I fully understand your data set up. But, to the extent I do, it appears that your dataset of bank variables has only one observation per quarter, whereas your credit rating data set has daily observations. It is easy enough to convert the daily dates to quarterly dates in both files, but then when you merge them you get a many to 1 merge, leaving multiple observations per quarter for each bank in your data. That precludes -xtset-ing the data.

It seems to me that there are two ways to proceed. You will need to decide based on your context and research questions which is more appropriate.

1. You can reduce the credit rating data set to one observation per quarter. If the credit rating is always the same throughout a quarter, this is easy: just pick one observation out of each quarter. If the credit rating can change during a quarter, then you need to identify some rule: perhaps you want the rating at the beginning, or the one at the end, or the best rating, or the worst, or the median, or.... In the end, you retain one observation on credit rating for each quarter than you can 1:1 merge with the other quarterly data and -xtset- from there. Evidently this approach involves some loss of information, and you need to decide if that is acceptable in your context.

OR

2. The opposite approach is to decide to treat this as an analysis of daily data. In that case you can work with the data set as you already have it. But you now need to create a daily date variable. -gen daily_date = date(Data, "DM20Y")- and -format daily_date %td-. So this is, except for formatting, the approach you started with. Assuming that there is only one observation per bank per date, you will be able to -xtset gvkey daily_date- and work from there. The drawback is that you don't have true daily observations of the bank data variables, so there is some misspecification involved here.

Aside: going forward please don't post screenshots. Often they are unreadable. Yours happen to be readable, but even so, if I wanted to try to work with this data, I can't copy and paste from a screenshot. The best way to show sample data is to use -dataex- (ssc install -dataex- if you don't have it). Run -dataex- to generate the sample of data you want to show, and the paste the results directly into the forum. (It will already be set up as a code block.)
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#8

18 Dec 2015, 11:09

Thank you Clyde Schechter . For my research option 1 suits best. I want the rating at the end of the quarter. This corresponds with the data I have of my bank data set, namely: 31-12-20XX 31-03-20XX 30-06-20XX 30-09-20XX.

In order to go ahead with option 1, is it okay to just delete those observation which are not a match with my bank data? These are for instance the credit ratings of banks on dates 28-02-20XX, 29-02-20XX, etc. ?

Regarding the screenshots: Sorry for this. Next time I do it the way you stated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#9

18 Dec 2015, 11:28

Let's back up a little bit. You seem to have two data sources that you have -merged- into a single data file. Let's call them credit_ratings.dta and banks.dta. They are linked on gvkey. The banks.dta contains a quarterly date which I'll call b_date. The credit_ratings.dta file contains a daily date which I'll just callcr_ date. Both qdate and date appear as DM20Y strings in their original data sets. So you want to put these together, retaining the last date from each quarter in credit_ratings that has a match in banks.dta. So it would look like this:

Code:

use banks, clear gen qdate = dofq(daily(b_date, "DM20Y")) format qdate %tq tempfile holding save `holding' use credit_ratings,clear gen daily_date = daily(d_date, "DM20Y") gen qdate = dofq(d_date) format daily_date %td format qdate %tq merge m:1 gvkey qdate using `holding' keep if _merge == 3 // DATA THAT LINK UP ONLY by gvkey qdate (daily_date), sort: keep if _n == _N iside gvkey qdate, sort save merged_filtered_data, replace

This will create a data set containing only data that match up between the two file, and it specifically retains the last such observation in each quarter for each bank.

As for the appropriateness of just omitting unmatched data, I repeat my comment in #4. The suitability of doing that depends on why there are records that don't match, and whether those that do are, for your purposes, a biased subsample.
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#10

18 Dec 2015, 11:43

Again many thanks Clyde Schechter . However, in your post you said that you call the date variable cr_date, while in the code it is named d_date? I named everything in my files the same as in the code you provided, but there are some errors. Do I have to use cr_date instead? So you did not make a mistake in your text?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#11

18 Dec 2015, 13:01

Sorry about that. Yes, I goofed. I started out with one set of variable names in mind, then switched to another, and forgot to update all of the code. The middle "paragraph" of the code in #9 should look like this:

Code:

use credit_ratings,clear gen daily_date = daily(cr_date, "DM20Y") gen qdate = dofq(daily_date) format daily_date %td format qdate %tq

Note: changes in bold face.
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#12

18 Dec 2015, 15:25

Np about the mistake.

Unfortunately, I do not get results with this (updated) code. The error occurred when the data got merged:

merge m:1 gvkey qdate using `holding'
variables gvkey qdate do not uniquely identify observations in the using data
r(459);

I tried m:m merge, or 1:m, but these are not working either.

Do you want a preview of my data as you described at the bottom of post #7?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#13

18 Dec 2015, 15:41

Well, Stata is telling you the problem. You have multiple observations with the same gvkey and qdate in the bank data. The solution to this is definitely not to do an m:m merge.

You need to review your bank data to find out why you have this. You can easily identify those observations to study them:

Code:

use banks, clear duplicates tag gvkey qdate, gen(flag) sort gvkey qdate list if flag, sepby(gvkey) /* IF THE ABOVE LIST COMMAND PRODUCES AN UNMANAGEABLY LARGE AMOUNT OF DATA TRY -browse if flag- INSTEAD. */

You will then have to figure out how to handle these. How did they get there in the first place? Are some of the duplicates the result of earlier data management errors when banks.dta was created? Are these observations also duplicates on all the variables? If this is the case, -duplicates drop gvkey qdate- will reduce the data set to 1 obs per gvkey qdate combination with no loss of information--save that and use it instead of banks.dta If there are genuinely different multiple observations for the same gvkey and qdate, is there some rule by which you can reasonably select just one for your purpose? (Again, the first, or the last, or the best, or the worst, or the median, or the mean, or the total, or.....)

At the end of the day, you need to resolve this underlying problem in your data, and it will rely far more on your knowledge of where the data comes from, what it means, and how you ultimately plan to use it, than on Stata coding issues (though, of course, I am happy to help with the latter as they arise.)
Comment
YH jordaan

Join Date: Dec 2015

Posts: 46
#14

18 Dec 2015, 18:12

I think I solved the problem. Think because of I used b_date instead of qdate, if it was a typo. However, there were 4 observations with the same date, this was an error in the file.

Only the last command of post #9 did not work:
iside gvkey qdate, sort But my data is matched now.

Thanks for the help so far. Great there exists a platform like this, saves me a lot of thesis writing stress
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#15

18 Dec 2015, 19:45

Only the last command of post #9 did not work:
iside gvkey qdate, sort But my data is matched now.

Ouch! That was a typo. It should have been

Code:

isid gvkey qdate, sort

Sorry about that. Glad you've got a workable data set now.
Comment

Announcement

Panel data date structure, Lagged values and deleting unmatched observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment