Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Should I use cluster robust standard errors in discete time survival analysis with person-period data

    Hi,

    I am new to this forum and hope that my post follows proper etiquette. (Please do let me know if I have not and will be happy to adjust). I am working with longitudinal complex survey data. All variables are updated every 2 years providing up to 8 waves/ time-points for analysis per person. I am interested in estimating the effect of a time-varying exposure (arthritis) controlling for both time-invariant (e.g. sex, race, education...) and time-varying covariates (chronic comorbidities, medication use, etc...) on the the time-to-first occurrence of heart disease in the dataset. My primary analytic approach is discrete time survival analysis. I have prepared the person period dataset with 8 time dummy variables reflecting calendar time from the beginning of the survey (1994/95 through 2010/11) and use a logit link for the model. The baseline hazard for the model with dummy time only as predictors is relatively flat given that the probability of developing incident heart disease in a given 2-year period is fairly constant however I keep time as dummies because I see some fluctuations in the models that include covariates.

    The data contains roughly 12,000 observations weighted to represent the Canadian population in 1994. Thus far I have been performing all analyses including 8 time-dummy variables [_d1-_d8], my time invariant (TI) and time varying covariates (TVC), a variable reflecting the survey weights [pweight] and cluster robust standard error vce(cluster id). My stata code looks something like this

    logit _Y _d1-_d8 TI TV [pweight=WT64LS], nocons vce(cluster REALUKEY)

    The recommended approach from Statistics Canada for analysis of their surveys is to use survey commands with bootstrap standard errors. Stata ofcourse has these options through the SVY commands however SVY: logit does not permit clustered standard errors

    Here are my questions:

    My intuition is that although the baseline hazard for the outcome heart disease is relatively constant, my time-varying covariates are very faily correlated within person. Particularly my time-varying exposure arthritis is an "absorbing state" such that once a person reports having arthritis they are treated as having arthritis until the outcome, death or censoring. So their future value is dependent on prior values.

    1. Would you recommend using the SVY logit procedure with no cluster standard errors or proceeding as I did above specifying the pweight and cluster id.

    2. My other issue is following Dr. Jenkins's lesson plans, I would like to test for unobserved heterogeneity using a multi-level model but have not been succesful at generating results, particularly with survey weights. Should I treat the dummy time-variables as fixed and random effects? Do I specify the survey weights as fixed and random effects? My experience thus far is that I am able to estimate a model using dummy time as fixed effects and id as random with no survey weights, however estimation procedures break down when I try to specify dummy time as random effects with pweights. Any guidance particularly with stata code would be greatly appreciated.

    Thanking you in advance

    Orit



  • #2
    I am afraid I cannot answer your questions but could you tell us who Dr. Jenkins is?

    Comment


    • #3
      Dr. Jenkins is a professor at the London School of Economics with an online course in survival analysis. Here is his website https://www.iser.essex.ac.uk/resourc...sis-with-stata

      Comment


      • #4
        1. Would you recommend using the SVY logit procedure with no cluster standard errors or proceeding as I did above specifying the pweight and cluster id.
        Why would you not follow Statistics Canada's recommendations? What the help for logit states is "vce(), nocoef, and weights are not allowed with the svy prefix". But what happens if you svyset the data first, and then use the svy prefix?

        2. My other issue is following Dr. Jenkins's lesson plans, I would like to test for unobserved heterogeneity using a multi-level model but have not been succesful at generating results, particularly with survey weights. Should I treat the dummy time-variables as fixed and random effects? Do I specify the survey weights as fixed and random effects? My experience thus far is that I am able to estimate a model using dummy time as fixed effects and id as random with no survey weights, however estimation procedures break down when I try to specify dummy time as random effects with pweights. Any guidance particularly with stata code would be greatly appreciated.
        Have you looked at melogit in Stata 14? According to its help file, "by and svy are allowed; see prefix." On your question about how to treat the binary indicator (dummy) variables characterising the baseline hazard, I don't really understand the question -- you are not very clear. It appears that you want to fit a model with a normally distributed random intercept (a common way of specifying unobserved heterogeneity ("frailty") and svy: melogit applied to appropriately svyset data will, I think, allow you to do this. [Personally, I would use mecloglog, because I prefer the proportional hazards interpretation of the parameters.] These sorts of models have been around for a long time -- see e.g. Bruce Meyer's paper in Econometrica 1990, which also uses a set of dummies to specify the baseline hazard. To me, the extra complication relative to those papers that you're asking about appears to be how to handle the survey design aspect. But the me suite in Stata 14 appears to be your friend for this.

        Comment


        • #5
          Hi Dr. Jenkins,

          Thank you for your rapid reply to my post. Regrets for not being clear in my first post. I can and have svyset my data as you suggest specifying the pweight and bootstrap replication values provided by statistics Canada .

          My first question would be to select between the following 2 models

          1. logit _Y _d1-_d8 TI TV [pweight=WT64LS], nocons vce(cluster REALUKEY)

          vs

          2. svy: logit _Y _d1-_d8 TI TV

          Note: The second model provides design-based point estimates and standard errors but treats every observation in the person-period dataset as independent. There is no way to specify a cluster() option with the svy procedure (that I can see). My question is, when and how important is it to specify cluster(id) in the context of discrete time survival analysis with data in person period format? Is the concern time dependence in the outcome only or do I need to also consider time-dependence in the predictor variables as well?

          Regarding the use of logit vs. cloglog links I have actually run the analysis both ways (The survival analysis notes you provide on your website are very clear and easy to follow; thank you for sharing). Given that my outcome is relatively rare (incidence approximately 0.02 per 2 year calendar time) the hazard odds ratios and standard errors (SEs) estimated with logit are a very close approximation for the hazard ratios and SEs estimated with cloglog. I have favoured the logit link as I thought it would be more familiar among readers in the medical journals and thought that it would also be consistent with using mlogit to examine competing risks. I am also hoping to eventually run a causal mediation analysis using methods by Vanderwheele and Valerie (2012) that provide stata code for estimation using logistic regression. Does this seem reasonable?

          I have tried estimating unobserved heterogeneity with melogit but did so trying to manually specify options for pweight in level 1 and level 2, which did not work well. I also could not ifure out if I needed to add the dummy time variables to the random part of the model. I did not know that it was supported by svy in STATA 14. THANK YOU for pointing this out. I will definitely try this approach. I assume estimation time will be lengthy...

          Thanks!

          Orit



          Comment


          • #6
            The second model provides design-based point estimates and standard errors but treats every observation in the person-period dataset as independent.
            This statement is not true if you have correctly specified a primary sampling unit in your svyset statement.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Interesting, I am provided with a single longitudinal sampling weight by statistics Canada as well as bootstrap values based on 500 replications. I do not have variables indicating the various stages of selection. So I use this longitudinal weight as the pweight and and bootstrap variable indicators in the svyset command.

              Comment


              • #8
                Quite right- I was mistaken: you don't need a PSU in our svyset with replicate weight variables. However, the replicates themselves are structured to represent the PSUs, so svyset does account for the clustering.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #9
                  The second model provides design-based point estimates and standard errors but treats every observation in the person-period dataset as independent. There is no way to specify a cluster() option with the svy procedure (that I can see). My question is, when and how important is it to specify cluster(id) in the context of discrete time survival analysis with data in person period format? Is the concern time dependence in the outcome only or do I need to also consider time-dependence in the predictor variables as well?
                  Usually I think of the psu variable in svyset as playing the role of the clustering variable -- except that note Steve Samuels's remarks about this in the case of replicate weights. However, survey design aspects aside, you do not need to account for clustering when you have person-period data. Remember that having the data organised in person-period format is simply a convenient "trick" to facilitate the fact that if binary regression models such as logit or cloglog, are applied to such data they are maximizing the correct likelihood function. That is, getting the "right" estimates for the model comes from the combination of data (re)organisation and binary regression model. Having repeated observations for each person is irrelevant in this context. There may be other reasons for accounting for clustering etc. -- notably survey design features -- but that is a different issue.

                  Comment


                  • #10
                    Thanks to all who are taking time to reply to me. I appreciate your help!

                    As per your recommendations I re-reran my discrete time survival analysis by svy setting my data followed by the svy logit procedure.

                    The stata code resembles this

                    svyset id [pweight=pweight] bsrweight(bsw1-bsw500) vce(bootstrap) mse

                    svy: logit Y dtime1 dtime2 dtime3... x1 x2 x3..

                    My question is regarding the stata output from the svylogit model. In the output preceding the regression table the number of observations listed reflects the number of observation rows in the person period dataset not individuals and the estimated population size again based on the number of observation rows not individuals, is grossly inflated. Can I just disregard this because the maximization procedure in the logit model should still be correct as per Dr. Jenkin's comment above?

                    Second I ran the melogit model to examine unobserved heterogeneity using the same svyset data

                    svy: melogit Y dtime1 dtime2 dtime3... x1 x2 x3.., noconstant || id:, or

                    I received an error message that svy melogit does not support vce bootstrap. Any suggestions on what alternate specification would be appropriate?

                    Thanks!

                    Orit

                    Comment


                    • #11
                      Regarding your first question: " grossly inflated" is the wrong terminology. The likelihood function is correct, as I explained. When reporting their results, researchers often report the number of person-period observations, and the number of persons.

                      Comment


                      • #12
                        Thank you!

                        Comment


                        • #13
                          Just wondering: Is " I am interested in estimating the effect of a time-varying exposure (arthritis) controlling for both time-invariant (e.g. sex, race, education...) and time-varying covariates (chronic comorbidities, medication use, etc...) on the the time-to-first occurrence of heart disease in the dataset." not prototypical for time-varying confounding - and therefore any attempt to adjust using "conventional" approaches e.g Cox or logistic regression '(including the mixed-models one) will introduce bias; see. http://www.ncbi.nlm.nih.gov/pubmed/10955408. One can deal with these types of scenarios using e.g. inverse probability weighting as implemented in marginal structural models (see http://www.stata-journal.com/article...article=st0075). Unfortunately, there is no out of the box Stata command to do so - and the new Stata 14 treatement effects for survival data command - stteffects - does not allow for time-varying covariates.
                          Cheers

                          Comment


                          • #14
                            Stefan: you've just posted on a different topic to that covered in this thread. Please re-post, starting a new thread.

                            Comment


                            • #15
                              -> I was quoting this thread's first posting by Orit Schieir - just a comment, not a question...
                              Regards!

                              Comment

                              Working...
                              X