Should I use cluster robust standard errors in discete time survival analysis with person-period data

Orit Schieir

Join Date: May 2015

Posts: 18
#1

Should I use cluster robust standard errors in discete time survival analysis with person-period data

20 May 2015, 09:12

Hi,

I am new to this forum and hope that my post follows proper etiquette. (Please do let me know if I have not and will be happy to adjust). I am working with longitudinal complex survey data. All variables are updated every 2 years providing up to 8 waves/ time-points for analysis per person. I am interested in estimating the effect of a time-varying exposure (arthritis) controlling for both time-invariant (e.g. sex, race, education...) and time-varying covariates (chronic comorbidities, medication use, etc...) on the the time-to-first occurrence of heart disease in the dataset. My primary analytic approach is discrete time survival analysis. I have prepared the person period dataset with 8 time dummy variables reflecting calendar time from the beginning of the survey (1994/95 through 2010/11) and use a logit link for the model. The baseline hazard for the model with dummy time only as predictors is relatively flat given that the probability of developing incident heart disease in a given 2-year period is fairly constant however I keep time as dummies because I see some fluctuations in the models that include covariates.

The data contains roughly 12,000 observations weighted to represent the Canadian population in 1994. Thus far I have been performing all analyses including 8 time-dummy variables [_d1-_d8], my time invariant (TI) and time varying covariates (TVC), a variable reflecting the survey weights [pweight] and cluster robust standard error vce(cluster id). My stata code looks something like this

logit _Y _d1-_d8 TI TV [pweight=WT64LS], nocons vce(cluster REALUKEY)

The recommended approach from Statistics Canada for analysis of their surveys is to use survey commands with bootstrap standard errors. Stata ofcourse has these options through the SVY commands however SVY: logit does not permit clustered standard errors

Here are my questions:

My intuition is that although the baseline hazard for the outcome heart disease is relatively constant, my time-varying covariates are very faily correlated within person. Particularly my time-varying exposure arthritis is an "absorbing state" such that once a person reports having arthritis they are treated as having arthritis until the outcome, death or censoring. So their future value is dependent on prior values.

1. Would you recommend using the SVY logit procedure with no cluster standard errors or proceeding as I did above specifying the pweight and cluster id.

2. My other issue is following Dr. Jenkins's lesson plans, I would like to test for unobserved heterogeneity using a multi-level model but have not been succesful at generating results, particularly with survey weights. Should I treat the dummy time-variables as fixed and random effects? Do I specify the survey weights as fixed and random effects? My experience thus far is that I am able to estimate a model using dummy time as fixed effects and id as random with no survey weights, however estimation procedures break down when I try to specify dummy time as random effects with pweights. Any guidance particularly with stata code would be greatly appreciated.

Thanking you in advance

Orit
Tags: None
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#2

20 May 2015, 10:17

I am afraid I cannot answer your questions but could you tell us who Dr. Jenkins is?
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#3

20 May 2015, 10:20

Dr. Jenkins is a professor at the London School of Economics with an online course in survival analysis. Here is his website https://www.iser.essex.ac.uk/resourc...sis-with-stata
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#4

20 May 2015, 10:44

1. Would you recommend using the SVY logit procedure with no cluster standard errors or proceeding as I did above specifying the pweight and cluster id.

Why would you not follow Statistics Canada's recommendations? What the help for logit states is "vce(), nocoef, and weights are not allowed with the svy prefix". But what happens if you svyset the data first, and then use the svy prefix?

2. My other issue is following Dr. Jenkins's lesson plans, I would like to test for unobserved heterogeneity using a multi-level model but have not been succesful at generating results, particularly with survey weights. Should I treat the dummy time-variables as fixed and random effects? Do I specify the survey weights as fixed and random effects? My experience thus far is that I am able to estimate a model using dummy time as fixed effects and id as random with no survey weights, however estimation procedures break down when I try to specify dummy time as random effects with pweights. Any guidance particularly with stata code would be greatly appreciated.

Have you looked at melogit in Stata 14? According to its help file, "by and svy are allowed; see prefix." On your question about how to treat the binary indicator (dummy) variables characterising the baseline hazard, I don't really understand the question -- you are not very clear. It appears that you want to fit a model with a normally distributed random intercept (a common way of specifying unobserved heterogeneity ("frailty") and svy: melogit applied to appropriately svyset data will, I think, allow you to do this. [Personally, I would use mecloglog, because I prefer the proportional hazards interpretation of the parameters.] These sorts of models have been around for a long time -- see e.g. Bruce Meyer's paper in Econometrica 1990, which also uses a set of dummies to specify the baseline hazard. To me, the extra complication relative to those papers that you're asking about appears to be how to handle the survey design aspect. But the me suite in Stata 14 appears to be your friend for this.
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#5

20 May 2015, 11:46

Hi Dr. Jenkins,

Thank you for your rapid reply to my post. Regrets for not being clear in my first post. I can and have svyset my data as you suggest specifying the pweight and bootstrap replication values provided by statistics Canada .

My first question would be to select between the following 2 models

1. logit _Y _d1-_d8 TI TV [pweight=WT64LS], nocons vce(cluster REALUKEY)

vs

2. svy: logit _Y _d1-_d8 TI TV

Note: The second model provides design-based point estimates and standard errors but treats every observation in the person-period dataset as independent. There is no way to specify a cluster() option with the svy procedure (that I can see). My question is, when and how important is it to specify cluster(id) in the context of discrete time survival analysis with data in person period format? Is the concern time dependence in the outcome only or do I need to also consider time-dependence in the predictor variables as well?

Regarding the use of logit vs. cloglog links I have actually run the analysis both ways (The survival analysis notes you provide on your website are very clear and easy to follow; thank you for sharing). Given that my outcome is relatively rare (incidence approximately 0.02 per 2 year calendar time) the hazard odds ratios and standard errors (SEs) estimated with logit are a very close approximation for the hazard ratios and SEs estimated with cloglog. I have favoured the logit link as I thought it would be more familiar among readers in the medical journals and thought that it would also be consistent with using mlogit to examine competing risks. I am also hoping to eventually run a causal mediation analysis using methods by Vanderwheele and Valerie (2012) that provide stata code for estimation using logistic regression. Does this seem reasonable?

I have tried estimating unobserved heterogeneity with melogit but did so trying to manually specify options for pweight in level 1 and level 2, which did not work well. I also could not ifure out if I needed to add the dummy time variables to the random part of the model. I did not know that it was supported by svy in STATA 14. THANK YOU for pointing this out. I will definitely try this approach. I assume estimation time will be lengthy...

Thanks!

Orit
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

20 May 2015, 16:12

The second model provides design-based point estimates and standard errors but treats every observation in the person-period dataset as independent.

This statement is not true if you have correctly specified a primary sampling unit in your svyset statement.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#7

20 May 2015, 16:57

Interesting, I am provided with a single longitudinal sampling weight by statistics Canada as well as bootstrap values based on 500 replications. I do not have variables indicating the various stages of selection. So I use this longitudinal weight as the pweight and and bootstrap variable indicators in the svyset command.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

20 May 2015, 17:47

Quite right- I was mistaken: you don't need a PSU in our svyset with replicate weight variables. However, the replicates themselves are structured to represent the PSUs, so svyset does account for the clustering.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#9

21 May 2015, 02:23

The second model provides design-based point estimates and standard errors but treats every observation in the person-period dataset as independent. There is no way to specify a cluster() option with the svy procedure (that I can see). My question is, when and how important is it to specify cluster(id) in the context of discrete time survival analysis with data in person period format? Is the concern time dependence in the outcome only or do I need to also consider time-dependence in the predictor variables as well?

Usually I think of the psu variable in svyset as playing the role of the clustering variable -- except that note Steve Samuels's remarks about this in the case of replicate weights. However, survey design aspects aside, you do not need to account for clustering when you have person-period data. Remember that having the data organised in person-period format is simply a convenient "trick" to facilitate the fact that if binary regression models such as logit or cloglog, are applied to such data they are maximizing the correct likelihood function. That is, getting the "right" estimates for the model comes from the combination of data (re)organisation and binary regression model. Having repeated observations for each person is irrelevant in this context. There may be other reasons for accounting for clustering etc. -- notably survey design features -- but that is a different issue.
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#10

21 May 2015, 19:45

Thanks to all who are taking time to reply to me. I appreciate your help!

As per your recommendations I re-reran my discrete time survival analysis by svy setting my data followed by the svy logit procedure.

The stata code resembles this

svyset id [pweight=pweight] bsrweight(bsw1-bsw500) vce(bootstrap) mse

svy: logit Y dtime1 dtime2 dtime3... x1 x2 x3..

My question is regarding the stata output from the svylogit model. In the output preceding the regression table the number of observations listed reflects the number of observation rows in the person period dataset not individuals and the estimated population size again based on the number of observation rows not individuals, is grossly inflated. Can I just disregard this because the maximization procedure in the logit model should still be correct as per Dr. Jenkin's comment above?

Second I ran the melogit model to examine unobserved heterogeneity using the same svyset data

svy: melogit Y dtime1 dtime2 dtime3... x1 x2 x3.., noconstant || id:, or

I received an error message that svy melogit does not support vce bootstrap. Any suggestions on what alternate specification would be appropriate?

Thanks!

Orit
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#11

22 May 2015, 00:42

Regarding your first question: " grossly inflated" is the wrong terminology. The likelihood function is correct, as I explained. When reporting their results, researchers often report the number of person-period observations, and the number of persons.
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#12

25 May 2015, 19:44

Thank you!
Comment
Stefan Kreisel

Join Date: Jul 2014

Posts: 28
#13

27 May 2015, 06:15

Just wondering: Is " I am interested in estimating the effect of a time-varying exposure (arthritis) controlling for both time-invariant (e.g. sex, race, education...) and time-varying covariates (chronic comorbidities, medication use, etc...) on the the time-to-first occurrence of heart disease in the dataset." not prototypical for time-varying confounding - and therefore any attempt to adjust using "conventional" approaches e.g Cox or logistic regression '(including the mixed-models one) will introduce bias; see. http://www.ncbi.nlm.nih.gov/pubmed/10955408. One can deal with these types of scenarios using e.g. inverse probability weighting as implemented in marginal structural models (see http://www.stata-journal.com/article...article=st0075). Unfortunately, there is no out of the box Stata command to do so - and the new Stata 14 treatement effects for survival data command - stteffects - does not allow for time-varying covariates.
Cheers
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#14

27 May 2015, 06:54

Stefan: you've just posted on a different topic to that covered in this thread. Please re-post, starting a new thread.
Comment
Stefan Kreisel

Join Date: Jul 2014

Posts: 28
#15

27 May 2015, 07:42

-> I was quoting this thread's first posting by Orit Schieir - just a comment, not a question...
Regards!
Comment

Announcement

Should I use cluster robust standard errors in discete time survival analysis with person-period data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment