perfect prediction in -heckman, select()-

skolenik

Join Date: Mar 2014

Posts: 100
#1

perfect prediction in -heckman, select()-

18 Mar 2016, 15:05

Do heckman / heckprob attempt to identify perfect prediction in the selection equation? From the output, it does not seem like they do; logit or probit output would have said something like "blah predicts success perfectly; it is dropped and this many observations not used". However heckman does not say that.

I think I am running into the issue with that, as Heckman model fails to converge (without the -difficult- option) or produces coefficients like 5 with a standard error of zero for a dummy variable in the selection equation. As normal(-5) is about the same as c(epsfloat), I suspect maximization just sends that parameter to a large enough value for the likelihood not to change... rather than attempting to remove it the way logit or probit do.

-- Stas Kolenikov || http://stas.kolenikov.name
-- Principal Survey Scientist, Abt SRBI
-- Opinions stated in this post are mine only
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

18 Mar 2016, 20:40

Whle the discussion in the Techical Note section of the heckman documentation in the Stata Reference Manual does not directly address your problem, the following quote is suggestive.

The Heckman selection model can be unstable when the model is not properly specified or if a specific dataset simply does not support the model’s assumptions.

The discussion continues with a simulation example using data generated according to a heckman process, the heckman estimation fails to properly converge.
Comment

Panos Markou

Join Date: May 2014
Posts: 19

18 Oct 2017, 10:31

If I may hijack this thread, I have a very similar question to that posed by Stas above.

I am using heckprobit. In the first stage (selection equation), I have a categorical variable which perfectly predicts the outcome. This variable is retained in the model, and I obtain coefficients and standard errors. However, when I estimate the selection equation with probit, the categorical variable is (as expected) omitted due to perfect prediction. Why do heckman and heckprobit retain perfect predictors in the first stage?

Interestingly, this thread over at the old Stata forum asks the same question. Indeed, STB-43 here also states:

(STB-43) heckman
heckman now has the capability to estimate models with variables that perfectly predict selection. Previously heckman would simply drop such variables from the selection equation, which is inappropriate in most cases.

Why is dropping variables which perfectly predict selection inappropriate? What is going on behind the scenes?

Example:

In my own research, I am examining the success of pharmaceutical drugs. My selection equation involves regressing IntoDevelopment onto DevelopmentStatus. By definition, I have coded all drugs which have progressed past the Discovery stage as 1, since they have entered into clinical development. Drugs in the pre-clinical Discovery phase did not necessarily enter into development and so there is variation here.

Code:

DevelopmentStatu | IntoDevelopment    
               s |         0          1 |     Total
-----------------+----------------------+----------
        Clinical |         0         63 |        63
       Discovery |     3,820      3,078 |     6,898
Phase 1 Clinical |         0        535 |       535
Phase 2 Clinical |         0        854 |       854
Phase 3 Clinical |         0        387 |       387
Pre-registration |         0        113 |       113
      Registered |         0         95 |        95
-----------------+----------------------+----------
           Total |     3,820      5,125 |     8,945

With heckprobit:

Code:

. heckprobit ... , select(drugIndicationIntoDevelopment = ... ib2.DevStatus ib2016.Year)

**Iterations Output Omitted**

Probit model with sample selection              Number of obs     =      7,555
                                                Censored obs      =      3,779
                                                Uncensored obs    =      3,776

                                                Wald chi2(52)     =     423.84
Log pseudolikelihood = -3955.044                Prob > chi2       =     0.0000

                                               (Std. Err. adjusted for 5,003 clusters in Drug)
--------------------------------------------------------------------------------------------------
                                 |               Robust
                                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------------------------+----------------------------------------------------------------
**Outcome Equation Omitted**
---------------------------------+----------------------------------------------------------------
**Selection Equation:**
**Extra controls omitted**

drugIndicationIntoDevelopment    |

                       DevStatus |
                       Clinical  |   7.636424   .2723167    28.04   0.000     7.102693    8.170155
               Phase 1 Clinical  |    9.34752   .5423626    17.23   0.000     8.284508    10.41053
               Phase 2 Clinical  |   7.617611   .2027719    37.57   0.000     7.220185    8.015037
               Phase 3 Clinical  |   7.460695   .1892162    39.43   0.000     7.089838    7.831552
               Pre-registration  |   7.800735   .2145818    36.35   0.000     7.380162    8.221307
                     Registered  |   7.485286   .1824119    41.04   0.000     7.127765    7.842806
                                 |
                            Year |
                           2003  |   2.953355   .3820132     7.73   0.000     2.204623    3.702087
                           2004  |   3.044034   .3821105     7.97   0.000     2.295111    3.792956
                           2005  |   3.244613   .3805017     8.53   0.000     2.498844    3.990383
                           2006  |   3.403035   .3775828     9.01   0.000     2.662987    4.143084
                           2007  |   3.278851   .3780417     8.67   0.000     2.537903      4.0198
                           2008  |   3.022322   .3774185     8.01   0.000     2.282596    3.762049
                           2009  |   3.185404   .3832624     8.31   0.000     2.434223    3.936584
                           2010  |   3.003247   .3760539     7.99   0.000     2.266195    3.740299
                           2011  |   2.964679   .3764863     7.87   0.000     2.226779    3.702578
                           2012  |   2.594776   .3786905     6.85   0.000     1.852556    3.336996
                           2013  |   2.464523   .3795631     6.49   0.000     1.720593    3.208453
                           2014  |   2.317342   .3823459     6.06   0.000     1.567958    3.066726
                           2015  |   1.944325   .3852784     5.05   0.000     1.189193    2.699457
                                 |
                           _cons |  -3.467664   .4065917    -8.53   0.000    -4.264569   -2.670759
---------------------------------+----------------------------------------------------------------
                         /athrho |  -.4123434   .1905769    -2.16   0.030    -.7858672   -.0388195
---------------------------------+----------------------------------------------------------------
                             rho |  -.3904606   .1615216                     -.6560615      -.0388
--------------------------------------------------------------------------------------------------
Wald test of indep. eqns. (rho = 0): chi2(1) =     4.68   Prob > chi2 = 0.0305

With probit on the exact same selection equation:

Code:

. probit ... i.DevStatus ib2016.Year

note: 1.DevStatus != 0 predicts success perfectly
      1.DevStatus dropped and 44 obs not used

note: 2.DevStatus != 1 predicts success perfectly
      2.DevStatus dropped and 1833 obs not used

note: 3.DevStatus omitted because of collinearity
note: 4.DevStatus omitted because of collinearity
note: 5.DevStatus omitted because of collinearity
note: 6.DevStatus omitted because of collinearity
note: 7.DevStatus omitted because of collinearity
Iteration 0:   log pseudolikelihood = -5287.0055  
Iteration 1:   log pseudolikelihood = -4413.1199  
Iteration 2:   log pseudolikelihood = -4402.1526  
Iteration 3:   log pseudolikelihood = -4402.1196  
Iteration 4:   log pseudolikelihood = -4402.1196  

Probit regression                               Number of obs     =      7,628
                                                Wald chi2(49)     =     979.95
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -4402.1196               Pseudo R2         =     0.1674

                                               (Std. Err. adjusted for 5,007 clusters in Drug)
--------------------------------------------------------------------------------------------------
                                 |               Robust
   drugIndicationIntoDevelopment |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------------------------+----------------------------------------------------------------
**Extra controls omitted**

                       DevStatus |
                       Clinical  |          0  (empty)
                      Discovery  |          0  (omitted)
               Phase 1 Clinical  |          0  (empty)
               Phase 2 Clinical  |          0  (empty)
               Phase 3 Clinical  |          0  (empty)
               Pre-registration  |          0  (empty)
                     Registered  |          0  (empty)
                                 |
                            Year |
                           2003  |   .5720061   .1156278     4.95   0.000     .3453799    .7986324
                           2004  |   .6048604   .1187959     5.09   0.000     .3720247    .8376961
                           2005  |   .8984012   .1137214     7.90   0.000     .6755113    1.121291
                           2006  |   .9815531   .1058797     9.27   0.000     .7740326    1.189073
                           2007  |   .9060379   .1045031     8.67   0.000     .7012156     1.11086
                           2008  |   .6564245   .1040019     6.31   0.000     .4525845    .8602645
                           2009  |   .8369335   .1161109     7.21   0.000     .6093603    1.064507
                           2010  |   .6413842   .0970183     6.61   0.000     .4512319    .8315365
                           2011  |   .6870169   .0992929     6.92   0.000     .4924064    .8816275
                           2012  |   .4555852   .0986859     4.62   0.000     .2621645     .649006
                           2013  |   .5481051   .0980139     5.59   0.000     .3560015    .7402087
                           2014  |   .6183751   .0982493     6.29   0.000       .42581    .8109401
                           2015  |   .5577375   .0983056     5.67   0.000      .365062    .7504129
                                 |
                           _cons |  -1.101107   .1592578    -6.91   0.000    -1.413246   -.7889672
--------------------------------------------------------------------------------------------------

Announcement

perfect prediction in -heckman, select()-

Comment

Comment