Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cox regression segments

    Hello everyone,

    I am no expert on econometrics, so I hope you'll forgive me if I formulate my question a bit strange. It is regarding the Cox proportional hazard model.

    I want to do something similar to Cox-Edwards and Ureta in their paper "International Migration, Remittances, and Schooling: Evidence from El Salvador" (
    http://www.nber.org/papers/w9766). The proportional hazard assumption is violated, so they use segments for some of the covaraites for which the effect does not apply to the entire hazard. So for instance parental schooling has a different hazard ratio when it comes to the hazard of enrolling in school, for primary education and secondary education, and so on.

    So far I only found options like tvc(...) that deal with time-varying covariates, but that basically only multiplies the covariate with a time function, something I am not interested in. The only possible way I can think of to avoid violation of proportional hazard in such case is by censoring and separating Cox regressions for each segment of the hazard...

    Is there any way to run coefficients on segments of one hazard?

    Thanks.
    Last edited by chrisb; 14 May 2014, 05:37.

  • #2
    I don't quite understand your data description. Do you mean that one of the variables in the data set is parental education (categorical). If you want a separate hazard for each parental education group, wouldn't you use the strata(parental_education) option in your stcox regression?

    Comment


    • #3
      Thanks for the quick reply!

      Parental schooling (measured in years) is one of the variables, so is a male dummy, access to electricity, income, a remittance dummy and the amount of remittances, the time that is looked at is between starting school and 12 years of education, the event is dropping out of school.
      Cox-Edwards and Ureta stratified on year of birth since it continued to fail the proportional hazard assumption even after dividing the baseline up (in the 4 segments starting school, 1st to 6th grad, 7-9, 10-12). Stratifying on remittances, parental education ... is not really an option for me there, since I am interested in the hazard ratios.

      In the table male for instance was applied over the whole segment, since it did not violate prop hazard assumption, for parental schooling and remittance amounts, multiple hazard ratios and S.E. were reported, one for each segment, something like:

      Male Hazard ratio1.2 S.E. 0.5
      Parental Schooling; 1st-6th grade H.r. 0.86 S.E. 0.1
      Parental Schooling; 7th-9th grade H.r. 0.91 S.E. 0.2
      ....
      Last edited by chrisb; 14 May 2014, 09:10.

      Comment


      • #4
        Note that I work in biostatistics so use terminology from my own field

        If I've understood correctly, the underlying timescale is school grade and the outcome is dropout. For example, a child who dropped out during the second grade would have a "survival time" of 2.

        I think you can achieve what you want by "episode splitting" using stsplit. For example,

        stsplit segment, at(0 1 7 10 99) id(id)

        Will create separate observations for each segment. The variable segment will be created to index these observations. For erxample, the original data for a child who dropped out in grade 8 would look like this.

        Code:
        _t0 _t _d
          0   8  1
        After splitting, the data for this child will look like this.

        Code:
        _t0 _t _d segment
         0   1  0   0
         1   7  0   1
         7   8  1   7

        If you fit the model

        Code:
        stcox male parental_schooling
        then you will get the same estimates on the split and unsplit data (assuming PH for both variables). However, with the split data you can do

        Code:
        stcox male parental_schooling##segment
        The effect of sex is assumed the same across all segments, whereas you now get separate estimates of the HR for parental schooling for each of the segments.

        The estimate for the "main effect of segment" will look odd and can be ignored (it represents the effect of time in a model where time is the underlying time scale so is already adjusted for) but the effects of explanatory variables for each segment can be interpreted.

        I'm not completely clear how the data are structured, e.g., what is time zero and what value of time is assigned to children who never enrol, but I hope this gives you an idea of how you might approach the problem.





        Comment


        • #5
          Thank you for the reply!

          I also thought about stsplit, but with the method you mentioned it still doesn't quite do what it is supposed to do.

          For instance, following your example,and split at 3 6 9 and 12 years I get the following result without segments:
          stcox male parental_schooling
          _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
          male | .8717109 .0472182 -2.53 0.011 .7839083 .9693481
          parentalschooling | .9338305 .0076813 -8.32 0.000 .9188961 .9490077

          These are the results when I use
          stcox c.male##segment parental_schooling


          | Robust
          _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
          ---------------+----------------------------------------------------------------
          male | .8205875 .0564249 -2.88 0.004 .7171251 .9389768
          |
          segment |
          3 | 16.2627 . . . . .
          6 | 5.744286 . . . . .
          9 | 8.131226 . . . . .
          |
          segment#c.male |
          3 | 1.427906 .2324393 2.19 0.029 1.037858 1.96454
          6 | 1.247167 .4276583 0.64 0.519 .6368605 2.442333
          9 | .9998442 .1395795 -0.00 0.999 .7605074 1.314502
          |
          head_educ | .9339513 .0076874 -8.30 0.000 .9190053 .9491405
          --------------------------------------------------------------------------------


          stcox c.male parental_schooling##segment
          -------------------------------------------------------------------------------------
          | Robust
          _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
          --------------------+----------------------------------------------------------------
          male | .8729167 .046767 -2.54 0.011 .7859034 .9695639
          parental_schooling | .915694 .0106874 -7.55 0.000 .8949849 .9368823
          |
          segment |
          3 | 2.803193 . . . . .
          6 | 1.884996 . . . . .
          9 | 19.32501 . . . . .
          |
          segment#c.parental_schooling |
          3 | 1.008617 .0203006 0.43 0.670 .9696035 1.049201
          6 | 1.082017 .0274309 3.11 0.002 1.029567 1.137139
          9 | 1.072762 .0187679 4.01 0.000 1.036601 1.110185





          These results for segment#c.male and segment#c.parental_schooling do not make any sense to me. There is still something wrong here I think, or how should it be interpreted? Is it something like the hazard in a segment compared to the entire hazard , the ones in the first lines?

          Thanks in advance!
          Last edited by chrisb; 15 May 2014, 08:19.

          Comment


          • #6
            Let's consider the model with the c.male##segment interaction. You have four segments so you will get 4 separate estimates of the HR for males. The first estimate (labelled male) is for the reference level of segment. That is, the HR for male/female for the first segment is 0.82.

            You then get 3 interaction effects. The HR for males/females for the second segment is 0.82*1.427=1.17 and for the 3rd segment it's 0.82*1.24=1.01.

            That is, if we assume proportional hazards then our estimated HR is 0.87. However, if we relax the PH assumption then we get estimates of 0.82, 1.17, 1.01, and 0.82 for the 4 segments.

            If you want Stata to give you HRs for each segment (with CIs) without doing these calculations by hand then you can use the lincom command. I believe you can also do this directly, but it requires a better understanding of Stata's way of specifying factor variables then I possess. I believe, it is something like

            stcox i.segment i.segment#i.male

            The key is to use # rather than ## and to specify the model so that it looks like you are putting in the main effect of the modifier. I don't do a lot of programming myself so don't know all the details off the top of my head (e.g., if it also works with c.). Whatever you do, I suggest you verify your results with lincom and also look at the observed data (plots of the survivor functions or empoirtical hazards).

            For example, the results from the Cox model suggest that the rate of dropouts for males is higher than females in the first and last segments, but females have a higher dropout rate than males in the second segment. You should be able to see this in plots of the empirical hazard (sts graph, by(male) hazard or of the Kaplan-Meier curves.

            Comment


            • #7
              It is working when I use the i.segment in the model! Thank you so much, I've been looking for this quite a while!

              I still have a few questions regarding the topic, if it's not too much trouble..

              The coefficients for the variables in the simple model stcox male parental_schooling (without using any #segment interactions) changed a little bit after splitting up the data. Can the models after the data splitting be considered a better estimate of the average impact of the variables on the hazard. Or are both models too inaccurate, since the Cox regression generally relies on proportional hazard?

              Do you happen to know how to run specific strata for certain segments? For instance, I want to have primary education stratified on all age groups (6-18 years for example), but the segment secondary education should only be stratified on the age groups that are old enough for it (something like 12-18 years).

              How can this be graphed? I've seen that there is the adjustfor command for stsgraph, where I think you have to centre the values of the covariates you want to 0. But I'm not sure how to incorporate segments (Stata says that interactions are not allowed in adjustfor, do I have to do these by hand!?) and how to choose a strata...

              Comment


              • #8
                Originally posted by chrisb View Post

                The coefficients for the variables in the simple model stcox male parental_schooling (without using any #segment interactions) changed a little bit after splitting up the data. Can the models after the data splitting be considered a better estimate of the average impact of the variables on the hazard. Or are both models too inaccurate, since the Cox regression generally relies on proportional hazard?
                There shouldn't be a difference in the estimates for the same model fitted to the split and unsplit data, assuming splitting has just changed the way the data is represented and not the content of the data. It's possible the splitting has changed the content of the data; you could check this superficially by using strate to check that the number of events and person-time at risk (for each combination of male and parental_schooling) haven't changed.

                Another possibility is that this is an artefact of not having truly continuous data. You probably have a large number of ties in your data and you are splitting at the event times. I don't think this should be a problem, but if there is no evidence the data have changed then maybe you could look into this. For example, do some sensitivity analyses (different approximations for dealing with ties and/or different intervals).

                I didn't previously bring up the issue of discrete vs continuous time because your aim was to replicate the analysis done by another group and I didn't want to open a Pandora's box.That box is now open and it's probably worth considering the implications of modelling your data assuming time is continuous. Have a look at professor Jenkin's notes on "Survival Analysis with Stata" (google should find them). Not only will you find details of continuous versus discrete time models but there's a lot of information that you will find useful (not the least how to use stsplit).

                Originally posted by chrisb View Post
                Do you happen to know how to run specific strata for certain segments? For instance, I want to have primary education stratified on all age groups (6-18 years for example), but the segment secondary education should only be stratified on the age groups that are old enough for it (something like 12-18 years).
                Just create your own dummy variables for the exact interaction terms you want to estimate. You can do this by creating the dummy variables yourself, or just create a new variable where some segments are grouped into a single category.

                Originally posted by chrisb View Post
                How can this be graphed? I've seen that there is the adjustfor command for stsgraph, where I think you have to centre the values of the covariates you want to 0. But I'm not sure how to incorporate segments (Stata says that interactions are not allowed in adjustfor, do I have to do these by hand!?) and how to choose a strata.
                What is it you want to graph, the survivor function, hazard function, or hazard ratio. I'm guessing you want the survivor function. My Stata skills are not that strong (compared to others on this list) so it's better I say less rather than mislead you. I'm more comfortable with my knowledge of concepts in modelling than I am with writing the code. That said, -stcurve- is the command for plotting fitted survival curves following stcox.

                Comment


                • #9
                  I think I've got everything I was looking for.

                  Thank you for taking the time to help me, I've really learned a lot!

                  Comment

                  Working...
                  X