Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Log-transforming control variable in Cox PH model (stcox) drastically changes results

    First time posting, please let me know if I'm breaking any rules or if more information is required.

    I am using stcox in Stata/BE 17.0 to run a Cox Proportional-Hazards model on some data. The dataset is a panel of mid- to large-size U.S. cities from 1977 to 2023, with 7,821 subjects, 46 time periods, and 127 failures.

    Following a reviewer comment, I changed the specification of my model to log-transform two of my control variables: population and the number of program units ("vouchers"). This, unexpectedly, drastically changed the hazard ratios for my independent variable of interest. I am looking for any insights on reasons why this might have occurred, and if there is any way to determine whether the log-transformation is correct or not. I recognize that it may also be the case that the model results being so sensitive to such an innocuous specification may just be an indication of spurious results.

    Relevant code below:
    gen log_population = ln(total_pop_i + 1)

    gen log_vouchers = ln(total_units_i + 1)

    stset year, origin(time 1977) enter(time 1977) id(jurisdiction) failure(soi_enacted == 1)

    stcox i.LPHA_principal black_diff other_diff total_pop_i total_units_i vacancy partisan_lean_i i.prior_state_law i.prior_state_preemption i.prior_county_law cumlocallaws_unitspct i.jurisdictiontype

    stcox i.LPHA_principal black_diff other_diff log_population log_vouchers vacancy partisan_lean_i i.prior_state_law i.prior_state_preemption i.prior_county_law cumlocallaws_unitspct i.jurisdictiontype

    The first stcox specification returns the following:

    Click image for larger version

Name:	Screenshot 2024-07-12 at 6.47.12 PM.png
Views:	2
Size:	230.5 KB
ID:	1758601


    The second returns:

    Click image for larger version

Name:	Screenshot 2024-07-12 at 6.50.41 PM.png
Views:	1
Size:	232.7 KB
ID:	1758602


    As you can see, the hazard ratios on the levels for "LPHA_principal" drop substantially and are no longer statistically different.

    Any insight into why this occurred and how to determine the correct specification would be greatly appreciated. I would be happy to supply additional data, code, etc.
    Attached Files

  • #2
    Henry:
    I do not understand the reason of your concern.
    Two different specifications of -stcox- gave back two different sets of coefficients.
    While the difference between the two HRs you seem really interested in apparently slimmed down, other predictors become statistically significant.
    Stating what are the reasons for these different results is probably out of reach, though.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Appreciate the response, Carlo!

      I have done some additional digging, and found that logged population and units violate the proportional-hazards assumption (per `estat phtest, detail') in ways that the non-log-transformed variables do not. Based on Box-Steffensmeir and Zorn (2001), I take this to mean that the non-log-transformed variables are likely preferred. And this does help explain the changes in coefficients given that "estimation of Cox's model when hazards do not satisfy the proportionality assumption can result in biased and inefficient estimates of all parameters, not simply those for the covariate(s) in question." (Box-Steffensmeir and Zorn, 2001).

      I still don't fully understand why a simple log-transformation would cause a variable to violate proportionality assumptions (and I'd be interested in further speculation!) but I feel like I can now give an informed response to my reviewer.

      Comment


      • #4
        Henry:
        unless you have time and wilingness to stat from a simple -stcox- (that is, an -stcox- with one predictor only), add one more predictor at a time (in logged and non-logged form), run -estat phtes, detail- every time, it is difficult to delve into the issue.
        That said, on a more practical note, I think your reply to the reviewer's comment is satisfying (or, better, I would be satisfied if I were the reviewer of your submission ).
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Applied to a variable like population, which ranges widely over mid to large sized cities, the log transformation is highly non-linear. So this transformation has far more radical effects than a simple linear transformation such as a change of scale or centering. Cox proportional hazards models are complicated and it is hard to develop intuitions for them. But let's focus on the proportional hazards assumption. The proportional hazards assumption for regressor X is logically equivalent to the assumption that the X#_t interaction effect is zero. That's what proportional hazards means.

          To see what log transformations can do more clearly, let's look at some simple linear regression in a toy data set:

          Code:
          . clear*
          
          . set obs 100
          Number of observations (_N) was 0, now 100.
          
          . set seed 1234
          
          .
          . gen x = 10*_n
          
          . gen y = 2*log(x) + 5 + rnormal(0, 0.5)
          
          . gen byte z = inrange(x, 20, 80)
          
          .
          .
          . regress y c.x##i.z
          
                Source |       SS           df       MS      Number of obs   =       100
          -------------+----------------------------------   F(3, 96)        =    136.76
                 Model |  296.693411         3  98.8978036   Prob > F        =    0.0000
              Residual |  69.4212208        96  .723137717   R-squared       =    0.8104
          -------------+----------------------------------   Adj R-squared   =    0.8045
                 Total |  366.114632        99  3.69812759   Root MSE        =    .85038
          
          ------------------------------------------------------------------------------
                     y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                     x |   .0047182   .0003268    14.44   0.000     .0040695    .0053669
                   1.z |  -4.750845   .8875774    -5.35   0.000    -6.512672   -2.989018
                       |
                 z#c.x |
                    1  |   .0491596   .0160739     3.06   0.003     .0172532    .0810661
                       |
                 _cons |   14.66168   .1970521    74.41   0.000     14.27053    15.05282
          ------------------------------------------------------------------------------
          
          .
          . gen log_x = log(x)
          
          .
          . regress y c.log_x##i.z
          
                Source |       SS           df       MS      Number of obs   =       100
          -------------+----------------------------------   F(3, 96)        =    474.80
                 Model |   342.99784         3  114.332613   Prob > F        =    0.0000
              Residual |  23.1167921        96  .240799918   R-squared       =    0.9369
          -------------+----------------------------------   Adj R-squared   =    0.9349
                 Total |  366.114632        99  3.69812759   Root MSE        =    .49071
          
          ------------------------------------------------------------------------------
                     y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                 log_x |   1.982217   .0694636    28.54   0.000     1.844333    2.120101
                   1.z |  -2.050523   1.621147    -1.26   0.209    -5.268473    1.167428
                       |
             z#c.log_x |
                    1  |   .5168595   .4126781     1.25   0.213       -.3023    1.336019
                       |
                 _cons |   5.114951   .4267539    11.99   0.000     4.267851     5.96205
          ------------------------------------------------------------------------------
          In this case, I created a toy data set where y is in fact strongly linearly related to log x. But if one didn't know that in advance and fitted an interaction model between x and z, you would get a strong interaction effect. But with the correct model of y vs log x, we see that the interaction effect shrinks to insignificance. Notice also that whereas z is strongly significant in the first model, its effect evaporates in the (correct) log-transformed model. If you have time, graphing the data superimposed on the putative regression lines of both models will make it quite clear what is going on.

          I suppose this is the opposite of what you found, whereby a log transformation introduced an interaction (i.e. caused a PH violation). But the principle is the same: log transformations can have drastic effects on models and, inappropriately using, or failure to use, a log transform can lead to very spurious results. In your case, the log transform introduced a PH violation, so you cannot use the results of that model, at least not with respect to the implicated variable. It is a mis-specified model. Whether to return to the untransformed variable, or to try to work around the PH violation by turning population into a variable with time-varying effects, is the question you face now. I don't have any advice to offer you on that.

          Comment

          Working...
          X