Data structure before using stset

Cooper Felix

Join Date: Sep 2015

Posts: 84
#1

Data structure before using stset

01 Nov 2018, 00:09

I'm planning to conduct a survival analysis. However, I have several questions related to the structure of data. Currently, the following info is known.

Data is available from 2007-2009, here is the definition:
id=firm id
current_yr=current fiscal year
founding_yr=year when the focal firm was found
bankruptcy_yr=year when the focal firm went out of business
OOB=out of business indicator (1 if a firms goes out of business and 0 otherwise)

Code:

id current_yr founding_yr bankruptcy_yr 1 2007 2005 2011 1 2008 2005 2011 2 2007 2007 2010 2 2008 2007 2010 2 2009 2007 2010 gen yr_since_found=current_yr-founding_yr gen OOB=bankruptcy_yr==current_yr

Here is my question: when I generate the dummy variable "OOB" to represent if a firm filed for bankruptcy (failed==1) or not (failed==0), for cases where the year of bankruptcy exceeds the sample range, how do I properly generate "OOB"? For instance, in the above data set, firm 1 didn't file for bankruptcy within the sample period, so I guess "OOB" should all be equal to 0 for firm 1, correct?

Code:

id current_yr founding_yr bankruptcy_yr OOB 1 2007 2005 2011 0 1 2008 2005 2011 0 2 2007 2007 2010 0 2 2008 2007 2010 0 2 2009 2007 2010 1 stset yr_since_found, failure(OOB==1)

Is there any error in the last command (stset)? Should I specify stset differently in order to conduct survival analysis? Thanks!

Last edited by Cooper Felix; 01 Nov 2018, 00:11.
Tags: data, survival analysis
Cooper Felix

Join Date: Sep 2015

Posts: 84
#2

01 Nov 2018, 08:08

It appears that I have the issue of right censoring - i.e., when bankruptcy date exceeds sample period, am I correct? Any comments or help would be appreciated.

Last edited by Cooper Felix; 01 Nov 2018, 08:26.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

02 Nov 2018, 11:54

See https://stats.idre.ucla.edu/stata/se...tata-survival/
1 like
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#4

04 Nov 2018, 11:26

Originally posted by Phil Bromiley View Post

See https://stats.idre.ucla.edu/stata/se...tata-survival/

Thanks, I did review that webpage before posting this question. My main question is should I label subjects who experienced the event after the conclusion of study as YES or NO. In my example above, I can identify when each firm filed for bankruptcy and some firms (e.g., Firm 1) did it after the sample period, so is it correct that I will assume Firm 1 did NOT experience the event (i.e., bankruptcy) within the sample period and set the failure indicator as 0 for Firm 1?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30104
#5

04 Nov 2018, 20:45

Your data structure looks inappropriate for what you want to do here. From what you describe, there is no reason to have multiple observations per id. You need a single observation for each id containing the following variables: the year it was founded (call it founding_year), a final_year which will be either the year in which it went bankrupt or, if it never went bankrupt, the last year for which you have information about it (probably the last value of current_year in your present data set), and a variable to indicate whether the firm went bankrupt or not.

How you determine whether the firm went bankrupt or not seems confusing to me. I would ordinarily think that it would be the case that the firm went bankrupt if the year of bankruptcy shown falls within the range of values of current_yr for that id. But your calculation in your example data does not agree with that, and I really don't understand how you decided that id 2 should have OOB = 1.

Anyway, if my proposed interpretation of how to decide whether the firm went bankrupt is correct, then the code would look like this:

Code:

// VERIFY CONSISTENCY OF FOUNDING_YR AND BANKRUPTCY_YR WITHIN ID by id (founding_yr), sort: assert founding_yr[1] == founding_yr[_N] by id (bankruptcy_yr),sort: assert bankruptcy_yr[1] == bankruptcy_yr[_N] // REDUCE TO ONE OBS PER ID collapse (first) founding_yr bankruptcy_yr (max) final_yr = current_yr, by(id) gen byte went_bankrupt = inrange(bankruptcy_yr, founding_yr, final_yr) stset final_yr, failure(went_bankrupt) origin(founding_yr)
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#6

04 Nov 2018, 22:47

Originally posted by Clyde Schechter View Post

Your data structure looks inappropriate for what you want to do here. From what you describe, there is no reason to have multiple observations per id. You need a single observation for each id containing the following variables: the year it was founded (call it founding_year), a final_year which will be either the year in which it went bankrupt or, if it never went bankrupt, the last year for which you have information about it (probably the last value of current_year in your present data set), and a variable to indicate whether the firm went bankrupt or not.

How you determine whether the firm went bankrupt or not seems confusing to me. I would ordinarily think that it would be the case that the firm went bankrupt if the year of bankruptcy shown falls within the range of values of current_yr for that id. But your calculation in your example data does not agree with that, and I really don't understand how you decided that id 2 should have OOB = 1.

Anyway, if my proposed interpretation of how to decide whether the firm went bankrupt is correct, then the code would look like this:

Code:

// VERIFY CONSISTENCY OF FOUNDING_YR AND BANKRUPTCY_YR WITHIN ID by id (founding_yr), sort: assert founding_yr[1] == founding_yr[_N] by id (bankruptcy_yr),sort: assert bankruptcy_yr[1] == bankruptcy_yr[_N] // REDUCE TO ONE OBS PER ID collapse (first) founding_yr bankruptcy_yr (max) final_yr = current_yr, by(id) gen byte went_bankrupt = inrange(bankruptcy_yr, founding_yr, final_yr) stset final_yr, failure(went_bankrupt) origin(founding_yr)

Dear Clyde,

Thanks for your response, I appreciate your comments. It looks like there was an error in my prior post. Now I changed the bankruptcy date of Firm 2 from 2010 to 2009. The data should look like the following:

Note the data is from 2007-2009, and my primary goal is to find out if a time_variant_var (e.g., R&D expenses) would contribute to firm bankruptcy, so I guess that's why I want to present the data in a long panel form. Does it make sense to you? In this case, should I still need to collapse the data as what you described - I suspect if I do so, I would not be able to estimate the impact of the time_variant_var?

Code:

id current_yr founding_yr bankruptcy_yr OOB time_variant_var 1 2007 2005 2011 0 100 1 2008 2005 2011 0 124 2 2007 2007 2009 0 50 2 2008 2007 2009 0 45 2 2009 2007 2009 1 10

Last edited by Cooper Felix; 04 Nov 2018, 22:50.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30104
#7

05 Nov 2018, 00:20

With a time-varying covariate you do need multiple observations per id. In that case, each observation corresponds to a time interval. The time interval ends with the variable that appears directly after -stset- and it begins with the end of the preceding time interval (or with the time of origin.) Then there is the failure variable, which now denotes the failure status at the end of the interval for that observation.

Code:

by id (founding_yr), sort: assert founding_yr[1] == founding_yr[_N] by id (bankruptcy_yr), sort: assert bankruptcy_yr[1] == bankruptcy_yr[_N] stset current_yr, failure(OOB) origin(founding_yr)
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#8

06 Nov 2018, 22:58

Originally posted by Clyde Schechter View Post

With a time-varying covariate you do need multiple observations per id. In that case, each observation corresponds to a time interval. The time interval ends with the variable that appears directly after -stset- and it begins with the end of the preceding time interval (or with the time of origin.) Then there is the failure variable, which now denotes the failure status at the end of the interval for that observation.

Code:

by id (founding_yr), sort: assert founding_yr[1] == founding_yr[_N] by id (bankruptcy_yr), sort: assert bankruptcy_yr[1] == bankruptcy_yr[_N] stset current_yr, failure(OOB) origin(founding_yr)

Dear Clyde,

You've been a big help to me. Thanks so much for your time and quick response!
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#9

07 Nov 2018, 00:15

Originally posted by Clyde Schechter View Post

With a time-varying covariate you do need multiple observations per id. In that case, each observation corresponds to a time interval. The time interval ends with the variable that appears directly after -stset- and it begins with the end of the preceding time interval (or with the time of origin.) Then there is the failure variable, which now denotes the failure status at the end of the interval for that observation.

Code:

by id (founding_yr), sort: assert founding_yr[1] == founding_yr[_N] by id (bankruptcy_yr), sort: assert bankruptcy_yr[1] == bankruptcy_yr[_N] stset current_yr, failure(OOB) origin(founding_yr)

Hi Clyde,

Sorry to bother you again. I'm trying to interpret the output from Cox regression. Based on the following results, I should interpret that RD is positively related to OOB (out of business) while CAP is negatively related to OOB (meaning CAP helps business to survive), am I correct?

A second question that I have is when I stset the data using what you suggested, Stata will exclude cases where current_yr equal the founding_yr, can I bypass this by forcing Stata to consider founding year in my subsequent analysis?

Please help at your earliest convenience. Thanks!

Code:

. stcox RD CAP,nohr nolog failure _d: OOB == 1 analysis time _t: (year-origin) origin: time founding_yr id: firm_id Cox regression -- Breslow method for ties No. of subjects = 13,832 Number of obs = 54,005 No. of failures = 1,130 Time at risk = 57420 LR chi2(2) = 691.18 Log likelihood = -10006.509 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ _t | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- RD | 2.164087 .5232751 4.14 0.000 1.138487 3.189688 CAP | -5.092229 .1812892 -28.09 0.000 -5.447549 -4.736909 ------------------------------------------------------------------------------

Last edited by Cooper Felix; 07 Nov 2018, 00:35.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30104
#10

07 Nov 2018, 08:18

I should interpret that RD is positively related to OOB (out of business) while CAP is negatively related to OOB (meaning CAP helps business to survive), am I correct?

Correct. Or it might be better to say it as increasing values of RD are associated with earlier occurrence of OOB, and increasing values of CAP are associated with later occurrence of OOB.

A second question that I have is when I stset the data using what you suggested, Stata will exclude cases where current_yr equal the founding_yr, can I bypass this by forcing Stata to consider founding year in my subsequent analysis?

Yes, you can, but you shouldn't try. These are cases that have "survived" 0 time at that point: the nature of analyzing a survival function is that, by definition, nobody starts off "dead."

There is one circumstance where you can and should monkey with this, but it requires different data from what you are working with. You may want to argue that the information in the data set describes the state of the business at the end of the year designated in current_year, and the founding date is most likely earlier in the year. So it would be reasonable to then give the business credit for having survived those first months. But to do that, you need your dates, both founding and current, to be more granular. You need monthly or daily dates to distinguish these points in time.
1 like
Comment

Announcement

Data structure before using stset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment