Event History Analysis vs. Logit

Davia Downey

Join Date: Jul 2017
Posts: 131

Event History Analysis vs. Logit

28 Jul 2021, 07:50

I have a conundrum regarding a policy analysis I'm running. I have a variable stateorder that is coded 0 when the policy was not in effect and 1 when it goes into effect and back to 0 when it was rescinded. I ran a logit yesterday on the data and received a message indicating that I might have complete separation or quasi-separation in the data which I'm not sure how to fix. Here's a data sample:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(stateorder medicaid_expansion demgov) float(div_gov percapita_deaths ideology_diff)
1 0 0 0  .0002178176 -13.41059
1 0 0 0 .00022515976 -13.41059
1 0 0 0 .00022719926 -13.41059
1 0 0 0  .0002286269 -13.41059
1 0 0 0 .00022923875 -13.41059
1 0 0 0  .0002373967 -13.41059
1 0 0 0 .00024698232 -13.41059
1 0 0 0 .00025085735 -13.41059
1 0 0 0 .00025799556 -13.41059
1 0 0 0  .0002622785 -13.41059
1 0 0 0 .00026248244 -13.41059
1 0 0 0 .00026329825 -13.41059
1 0 0 0 .00026574562 -13.41059
1 0 0 0 .00027818652 -13.41059
1 0 0 0 .00028491684 -13.41059
1 0 0 0 .00029327875 -13.41059
1 0 0 0 .00029694985 -13.41059
1 0 0 0   .000300417 -13.41059
1 0 0 0 .00030408805 -13.41059
1 0 0 0 .00030408805 -13.41059
1 0 0 0  .0003136737 -13.41059
1 0 0 0  .0003191803 -13.41059
1 0 0 0  .0003222395 -13.41059
1 0 0 0  .0003269303 -13.41059
1 0 0 0  .0003318251 -13.41059
1 0 0 0  .0003330488 -13.41059
1 0 0 0  .0003397791 -13.41059
1 0 0 0  .0003456937 -13.41059
1 0 0 0  .0003495687 -13.41059
1 0 0 0  .0003538516 -13.41059
1 0 0 0  .0003579306 -13.41059
1 0 0 0  .0003605819 -13.41059
1 0 0 0  .0003664965 -13.41059
1 0 0 0  .0003766939 -13.41059
1 0 0 0  .0003838321 -13.41059
1 0 0 0  .0003854637 -13.41059
1 0 0 0  .0003860756 -13.41059
1 0 0 0  .0003866874 -13.41059
1 0 0 0  .0003870953 -13.41059
1 0 0 0  .0003926019 -13.41059
1 0 0 0  .0003948454 -13.41059
1 0 0 0   .000396477 -13.41059
1 0 0 0  .0004025955 -13.41059
1 0 0 0 .00040708235 -13.41059
1 0 0 0  .0004101416 -13.41059
1 0 0 0  .0004105495 -13.41059
1 0 0 0  .0004127929 -13.41059
1 0 0 0  .0004154443 -13.41059
1 0 0 0 .00041707585 -13.41059
1 0 0 0 .00042339825 -13.41059
1 0 0 0  .0004297207 -13.41059
1 0 0 0  .0004388984 -13.41059
1 0 0 0  .0004409379 -13.41059
1 0 0 0  .0004450169 -13.41059
1 0 0 0  .0004486879 -13.41059
1 0 0 0  .0004521551 -13.41059
1 0 0 0 .00045541825 -13.41059
1 0 0 0  .0004621486 -13.41059
1 0 0 0  .0004639841 -13.41059
1 0 0 0  .0004641881 -13.41059
1 0 0 0  .0004641881 -13.41059
end

------------------ copy up to and including the previous line ------------------

I begin with a regression model which works just fine (however I understand we have potential issues with linearity, etc.):

HTML Code:

reg stateorder medicaid_expansion percapita_deaths ideology_diff prop_neighbors div_go
> v demgov

      Source |       SS           df       MS      Number of obs   =    22,908
-------------+----------------------------------   F(6, 22901)     =   2075.11
       Model |  1372.63861         6  228.773102   Prob > F        =    0.0000
    Residual |  2524.75269    22,901  .110246395   R-squared       =    0.3522
-------------+----------------------------------   Adj R-squared   =    0.3520
       Total |   3897.3913    22,907  .170139752   Root MSE        =    .33203

------------------------------------------------------------------------------------
        stateorder |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
medicaid_expansion |   .1776538   .0035522    50.01   0.000     .1706913    .1846162
  percapita_deaths |   20.83641   3.134149     6.65   0.000     14.69326    26.97955
     ideology_diff |  -.0006289   .0000425   -14.78   0.000    -.0007123   -.0005455
    prop_neighbors |  -.2364633   .0120137   -19.68   0.000    -.2600109   -.2129157
           div_gov |   .1939881   .0053275    36.41   0.000      .183546    .2044303
            demgov |   .2594809   .0049884    52.02   0.000     .2497034    .2692584
             _cons |    .510375    .007722    66.09   0.000     .4952394    .5255105
------------------------------------------------------------------------------------

The logistic regression (same model) produces this:

HTML Code:

logit stateorder medicaid_expansion percapita_deaths ideology_diff prop_neighbors div_
> gov demgov, nolog
note: div_gov != 0 predicts success perfectly
      div_gov dropped and 5976 obs not used

note: demgov != 0 predicts success perfectly
      demgov dropped and 6474 obs not used


Logistic regression                             Number of obs     =     10,458
                                                LR chi2(4)        =    1106.02
                                                Prob > chi2       =     0.0000
Log likelihood = -6684.0625                     Pseudo R2         =     0.0764

------------------------------------------------------------------------------------
        stateorder |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
medicaid_expansion |   .7738874   .0273959    28.25   0.000     .7201924    .8275824
  percapita_deaths |    136.772   29.96518     4.56   0.000     78.04132    195.5026
     ideology_diff |   .0088932   .0018851     4.72   0.000     .0051984     .012588
    prop_neighbors |  -2.262627   .1110186   -20.38   0.000     -2.48022   -2.045035
           div_gov |          0  (omitted)
            demgov |          0  (omitted)
             _cons |   .8999345   .0806817    11.15   0.000     .7418012    1.058068
------------------------------------------------------------------------------------

I suspect this happens because of the underlying data structure:

HTML Code:

 tabulate stateorder div_gov

           |  Divided Government
stateorder |         0          1 |     Total
-----------+----------------------+----------
         0 |     5,478          0 |     5,478
         1 |    12,948      6,474 |    19,422
-----------+----------------------+----------
     Total |    18,426      6,474 |    24,900

HTML Code:

tabulate stateorder demgov

           | Democratic Governor=1
stateorder |         0          1 |     Total
-----------+----------------------+----------
         0 |     5,478          0 |     5,478
         1 |     7,470     11,952 |    19,422
-----------+----------------------+----------
     Total |    12,948     11,952 |    24,900

Essentially, in this case, it would seem that having a Democratic Governor means state policies where enacted but there is variation on the Republican side. So I ran an additional logit selecting only demgov==1. Results presented below:

HTML Code:

logit stateorder medicaid_expansion percapita_deaths ideology_diff prop_neighbors  if
> demgov==0 & div_gov==1, nolog
outcome does not vary; remember:
                                  0 = negative outcome,
        all other nonmissing values = positive outcome
r(2000);

. logit stateorder medicaid_expansion percapita_deaths ideology_diff prop_neighbors  if
> demgov==0 & div_gov==0, nolog

Logistic regression                             Number of obs     =     10,458
                                                LR chi2(4)        =    1106.02
                                                Prob > chi2       =     0.0000
Log likelihood = -6684.0625                     Pseudo R2         =     0.0764

------------------------------------------------------------------------------------
        stateorder |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
medicaid_expansion |   .7738874   .0273959    28.25   0.000     .7201924    .8275824
  percapita_deaths |    136.772   29.96518     4.56   0.000     78.04132    195.5026
     ideology_diff |   .0088932   .0018851     4.72   0.000     .0051984     .012588
    prop_neighbors |  -2.262627   .1110186   -20.38   0.000     -2.48022   -2.045035
             _cons |   .8999345   .0806817    11.15   0.000     .7418012    1.058068
------------------------------------------------------------------------------------

Is there anyway that I can make the logit run with both options for the categorical variables on the right-side of the model, or is this something better suited for event history analysis? I believe for EHA I would need to add a duration variable somewhere in the data but I am not sure how to do this. Any advice or suggestions would be appreciated.

Last edited by Davia Downey; 28 Jul 2021, 07:50. Reason: logit

Tags: data, event history analysis, logit

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

28 Jul 2021, 10:51

When there is perfect separation, as here, the maximum likelihood estimate of the logistic regression coefficient for divgov or demgov would be infinitely large. Since the estimation cannot converge to infinity, Stata looks ahead for such problems and avoids the problem by removing the offending predictor(s). You can think of this as an extreme version of the situation where, say, demgov was associated with stateorder = 1 in all but one observation, and with stateorder = 0 in that singleton. In this case, the maximum likelihood estimate of the coefficient would not be infinite, but would be very large, and it can be shown that in rare-outcome settings like this the estimates are biased upward in magnitude.

One solution to this is to use penalized maximum likelihood estimation, which shrinks the estimates to reduce this kind of bias. Joseph Coveney's -firthlogit- program implements this. It is available from SSC.
Comment
Davia Downey

Join Date: Jul 2017

Posts: 131
#3

28 Jul 2021, 12:00

Thanks for the suggestion Clyde Schechter. I'll take a look. In terms of the second question (i.e., creating a duration variable), do you have any insight? I see there's a process using tsspell as well as one using xtset (mkduration) but when I follow either coding schema I don't get a variable that indicates the number of days with a policy. (I hope this makes sense).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

28 Jul 2021, 13:40

Well, your example data has no time variable, so I'm not sure what you can do.

I can demonstrate an approach that will work if each observation represents a single day, and that consecutive observations are consecutive days, no gaps. I also will make the assumption that in your real data set there is a variable, which I'll assume is called state, that identifies different states.

Code:

// CREATE A SEQUENTIAL DAY VARIABLE sort state, stable by state: gen int day = _n // CREATE A SPELL VARIABLE AND CALCULATE DURATION OF SPELLS by state (day), sort: gen int spell = sum(stateorder != stateorder[_n-1]) by state spell (day), sort: gen duration = _N

If you already have a date variable in your real data set, then there is no need to create the variable day, and you also would then replace day by the name of that variable in the other commands. Also, if the data are not really consecutive days, and if you have a date variable, the last command should be -by state spell (date), sort: gen duration = date[_N]-date[1] + 1-.

Caveat: as your data example contains no variable distinguish states, and the variable statorder is constantly 1 in the example, this code has not been properly tested, though I believe it is correct. When showing example data, it is usually best to pick a subset of the data that exhibits the variability in the data more fully.
Comment
Davia Downey

Join Date: Jul 2017

Posts: 131
#5

28 Jul 2021, 14:16

I have a daily_cases_date variable that tracks another component of the policy adoption process that I can use. This variable essentially is a daily variable (starting January 22 2020 and ending June 1 2021) and the state variable is in there I just didn't add it to my dataex (sorry about that!). I hope that even though this variable starts on Jan 22 and ends the following year that this should be an issue though I suspect it won't. Thanks again for your quick response. If I run into issues, I'll be sure to ping you directly.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#6

28 Jul 2021, 16:44

What about creating a single variable with 3 values, 1 = dem, 2 = rep with stateorder = 0, 3 = Rep with stateorder = 1. I usually discourage such hybrids but since one possible combo doesn't exist in practice, maybe it is not so bad. It seems a lot simpler than going to firthlogit, which has limited post-estimation options.

Of course, if this is a good idea, I hardly think I would have been the first to come up with it. So maybe it is a bad idea.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Davia Downey

Join Date: Jul 2017

Posts: 131
#7

28 Jul 2021, 20:41

Richard Williams Hmm…that’s actually not a terrible idea. Might try this tomorrow. I’ll let you know what happens! Clyde’s solution works and the results are similar to the OLS model so I think the model itself is solid but this might placate reviewers who aren’t familiar with firthlogit.
Comment

Announcement

Event History Analysis vs. Logit

Comment

Comment

Comment

Comment

Comment

Comment