Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Event History Analysis vs. Logit

    I have a conundrum regarding a policy analysis I'm running. I have a variable stateorder that is coded 0 when the policy was not in effect and 1 when it goes into effect and back to 0 when it was rescinded. I ran a logit yesterday on the data and received a message indicating that I might have complete separation or quasi-separation in the data which I'm not sure how to fix. Here's a data sample:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(stateorder medicaid_expansion demgov) float(div_gov percapita_deaths ideology_diff)
    1 0 0 0  .0002178176 -13.41059
    1 0 0 0 .00022515976 -13.41059
    1 0 0 0 .00022719926 -13.41059
    1 0 0 0  .0002286269 -13.41059
    1 0 0 0 .00022923875 -13.41059
    1 0 0 0  .0002373967 -13.41059
    1 0 0 0 .00024698232 -13.41059
    1 0 0 0 .00025085735 -13.41059
    1 0 0 0 .00025799556 -13.41059
    1 0 0 0  .0002622785 -13.41059
    1 0 0 0 .00026248244 -13.41059
    1 0 0 0 .00026329825 -13.41059
    1 0 0 0 .00026574562 -13.41059
    1 0 0 0 .00027818652 -13.41059
    1 0 0 0 .00028491684 -13.41059
    1 0 0 0 .00029327875 -13.41059
    1 0 0 0 .00029694985 -13.41059
    1 0 0 0   .000300417 -13.41059
    1 0 0 0 .00030408805 -13.41059
    1 0 0 0 .00030408805 -13.41059
    1 0 0 0  .0003136737 -13.41059
    1 0 0 0  .0003191803 -13.41059
    1 0 0 0  .0003222395 -13.41059
    1 0 0 0  .0003269303 -13.41059
    1 0 0 0  .0003318251 -13.41059
    1 0 0 0  .0003330488 -13.41059
    1 0 0 0  .0003397791 -13.41059
    1 0 0 0  .0003456937 -13.41059
    1 0 0 0  .0003495687 -13.41059
    1 0 0 0  .0003538516 -13.41059
    1 0 0 0  .0003579306 -13.41059
    1 0 0 0  .0003605819 -13.41059
    1 0 0 0  .0003664965 -13.41059
    1 0 0 0  .0003766939 -13.41059
    1 0 0 0  .0003838321 -13.41059
    1 0 0 0  .0003854637 -13.41059
    1 0 0 0  .0003860756 -13.41059
    1 0 0 0  .0003866874 -13.41059
    1 0 0 0  .0003870953 -13.41059
    1 0 0 0  .0003926019 -13.41059
    1 0 0 0  .0003948454 -13.41059
    1 0 0 0   .000396477 -13.41059
    1 0 0 0  .0004025955 -13.41059
    1 0 0 0 .00040708235 -13.41059
    1 0 0 0  .0004101416 -13.41059
    1 0 0 0  .0004105495 -13.41059
    1 0 0 0  .0004127929 -13.41059
    1 0 0 0  .0004154443 -13.41059
    1 0 0 0 .00041707585 -13.41059
    1 0 0 0 .00042339825 -13.41059
    1 0 0 0  .0004297207 -13.41059
    1 0 0 0  .0004388984 -13.41059
    1 0 0 0  .0004409379 -13.41059
    1 0 0 0  .0004450169 -13.41059
    1 0 0 0  .0004486879 -13.41059
    1 0 0 0  .0004521551 -13.41059
    1 0 0 0 .00045541825 -13.41059
    1 0 0 0  .0004621486 -13.41059
    1 0 0 0  .0004639841 -13.41059
    1 0 0 0  .0004641881 -13.41059
    1 0 0 0  .0004641881 -13.41059
    end
    ------------------ copy up to and including the previous line ------------------

    I begin with a regression model which works just fine (however I understand we have potential issues with linearity, etc.):

    HTML Code:
    reg stateorder medicaid_expansion percapita_deaths ideology_diff prop_neighbors div_go
    > v demgov
    
          Source |       SS           df       MS      Number of obs   =    22,908
    -------------+----------------------------------   F(6, 22901)     =   2075.11
           Model |  1372.63861         6  228.773102   Prob > F        =    0.0000
        Residual |  2524.75269    22,901  .110246395   R-squared       =    0.3522
    -------------+----------------------------------   Adj R-squared   =    0.3520
           Total |   3897.3913    22,907  .170139752   Root MSE        =    .33203
    
    ------------------------------------------------------------------------------------
            stateorder |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
    medicaid_expansion |   .1776538   .0035522    50.01   0.000     .1706913    .1846162
      percapita_deaths |   20.83641   3.134149     6.65   0.000     14.69326    26.97955
         ideology_diff |  -.0006289   .0000425   -14.78   0.000    -.0007123   -.0005455
        prop_neighbors |  -.2364633   .0120137   -19.68   0.000    -.2600109   -.2129157
               div_gov |   .1939881   .0053275    36.41   0.000      .183546    .2044303
                demgov |   .2594809   .0049884    52.02   0.000     .2497034    .2692584
                 _cons |    .510375    .007722    66.09   0.000     .4952394    .5255105
    ------------------------------------------------------------------------------------
    The logistic regression (same model) produces this:
    HTML Code:
    logit stateorder medicaid_expansion percapita_deaths ideology_diff prop_neighbors div_
    > gov demgov, nolog
    note: div_gov != 0 predicts success perfectly
          div_gov dropped and 5976 obs not used
    
    note: demgov != 0 predicts success perfectly
          demgov dropped and 6474 obs not used
    
    
    Logistic regression                             Number of obs     =     10,458
                                                    LR chi2(4)        =    1106.02
                                                    Prob > chi2       =     0.0000
    Log likelihood = -6684.0625                     Pseudo R2         =     0.0764
    
    ------------------------------------------------------------------------------------
            stateorder |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
    medicaid_expansion |   .7738874   .0273959    28.25   0.000     .7201924    .8275824
      percapita_deaths |    136.772   29.96518     4.56   0.000     78.04132    195.5026
         ideology_diff |   .0088932   .0018851     4.72   0.000     .0051984     .012588
        prop_neighbors |  -2.262627   .1110186   -20.38   0.000     -2.48022   -2.045035
               div_gov |          0  (omitted)
                demgov |          0  (omitted)
                 _cons |   .8999345   .0806817    11.15   0.000     .7418012    1.058068
    ------------------------------------------------------------------------------------
    I suspect this happens because of the underlying data structure:

    HTML Code:
     tabulate stateorder div_gov
    
               |  Divided Government
    stateorder |         0          1 |     Total
    -----------+----------------------+----------
             0 |     5,478          0 |     5,478
             1 |    12,948      6,474 |    19,422
    -----------+----------------------+----------
         Total |    18,426      6,474 |    24,900
    HTML Code:
    tabulate stateorder demgov
    
               | Democratic Governor=1
    stateorder |         0          1 |     Total
    -----------+----------------------+----------
             0 |     5,478          0 |     5,478
             1 |     7,470     11,952 |    19,422
    -----------+----------------------+----------
         Total |    12,948     11,952 |    24,900
    Essentially, in this case, it would seem that having a Democratic Governor means state policies where enacted but there is variation on the Republican side. So I ran an additional logit selecting only demgov==1. Results presented below:
    HTML Code:
    logit stateorder medicaid_expansion percapita_deaths ideology_diff prop_neighbors  if
    > demgov==0 & div_gov==1, nolog
    outcome does not vary; remember:
                                      0 = negative outcome,
            all other nonmissing values = positive outcome
    r(2000);
    
    . logit stateorder medicaid_expansion percapita_deaths ideology_diff prop_neighbors  if
    > demgov==0 & div_gov==0, nolog
    
    Logistic regression                             Number of obs     =     10,458
                                                    LR chi2(4)        =    1106.02
                                                    Prob > chi2       =     0.0000
    Log likelihood = -6684.0625                     Pseudo R2         =     0.0764
    
    ------------------------------------------------------------------------------------
            stateorder |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
    medicaid_expansion |   .7738874   .0273959    28.25   0.000     .7201924    .8275824
      percapita_deaths |    136.772   29.96518     4.56   0.000     78.04132    195.5026
         ideology_diff |   .0088932   .0018851     4.72   0.000     .0051984     .012588
        prop_neighbors |  -2.262627   .1110186   -20.38   0.000     -2.48022   -2.045035
                 _cons |   .8999345   .0806817    11.15   0.000     .7418012    1.058068
    ------------------------------------------------------------------------------------
    Is there anyway that I can make the logit run with both options for the categorical variables on the right-side of the model, or is this something better suited for event history analysis? I believe for EHA I would need to add a duration variable somewhere in the data but I am not sure how to do this. Any advice or suggestions would be appreciated.
    Last edited by Davia Downey; 28 Jul 2021, 07:50. Reason: logit

  • #2
    When there is perfect separation, as here, the maximum likelihood estimate of the logistic regression coefficient for divgov or demgov would be infinitely large. Since the estimation cannot converge to infinity, Stata looks ahead for such problems and avoids the problem by removing the offending predictor(s). You can think of this as an extreme version of the situation where, say, demgov was associated with stateorder = 1 in all but one observation, and with stateorder = 0 in that singleton. In this case, the maximum likelihood estimate of the coefficient would not be infinite, but would be very large, and it can be shown that in rare-outcome settings like this the estimates are biased upward in magnitude.

    One solution to this is to use penalized maximum likelihood estimation, which shrinks the estimates to reduce this kind of bias. Joseph Coveney's -firthlogit- program implements this. It is available from SSC.

    Comment


    • #3
      Thanks for the suggestion Clyde Schechter. I'll take a look. In terms of the second question (i.e., creating a duration variable), do you have any insight? I see there's a process using tsspell as well as one using xtset (mkduration) but when I follow either coding schema I don't get a variable that indicates the number of days with a policy. (I hope this makes sense).

      Comment


      • #4
        Well, your example data has no time variable, so I'm not sure what you can do.

        I can demonstrate an approach that will work if each observation represents a single day, and that consecutive observations are consecutive days, no gaps. I also will make the assumption that in your real data set there is a variable, which I'll assume is called state, that identifies different states.

        Code:
        //  CREATE A SEQUENTIAL DAY VARIABLE
        sort state, stable
        by state: gen int day = _n
        
        //  CREATE A SPELL VARIABLE AND CALCULATE DURATION OF SPELLS
        by state (day), sort: gen int spell = sum(stateorder != stateorder[_n-1])
        by state spell (day), sort: gen duration = _N
        If you already have a date variable in your real data set, then there is no need to create the variable day, and you also would then replace day by the name of that variable in the other commands. Also, if the data are not really consecutive days, and if you have a date variable, the last command should be -by state spell (date), sort: gen duration = date[_N]-date[1] + 1-.

        Caveat: as your data example contains no variable distinguish states, and the variable statorder is constantly 1 in the example, this code has not been properly tested, though I believe it is correct. When showing example data, it is usually best to pick a subset of the data that exhibits the variability in the data more fully.

        Comment


        • #5
          I have a daily_cases_date variable that tracks another component of the policy adoption process that I can use. This variable essentially is a daily variable (starting January 22 2020 and ending June 1 2021) and the state variable is in there I just didn't add it to my dataex (sorry about that!). I hope that even though this variable starts on Jan 22 and ends the following year that this should be an issue though I suspect it won't. Thanks again for your quick response. If I run into issues, I'll be sure to ping you directly.

          Comment


          • #6
            What about creating a single variable with 3 values, 1 = dem, 2 = rep with stateorder = 0, 3 = Rep with stateorder = 1. I usually discourage such hybrids but since one possible combo doesn't exist in practice, maybe it is not so bad. It seems a lot simpler than going to firthlogit, which has limited post-estimation options.

            Of course, if this is a good idea, I hardly think I would have been the first to come up with it. So maybe it is a bad idea.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Richard Williams Hmm…that’s actually not a terrible idea. Might try this tomorrow. I’ll let you know what happens! Clyde’s solution works and the results are similar to the OLS model so I think the model itself is solid but this might placate reviewers who aren’t familiar with firthlogit.

              Comment

              Working...
              X