Problem with Heckman model: Dependent variable never censored because of selection

Giorgio Piccitto

Join Date: Oct 2016
Posts: 238

Problem with Heckman model: Dependent variable never censored because of selection

07 Jul 2022, 05:00

Dear all, I am incurring in a strange (to me) problem when trying to apply the Heckman model to my data. In particular, I have a dependent variable (job satisfaction) which is observed only among those are employed (occupati), otherwise it has missing value. Since I want to compare women's and men's job satisfaction, I would like to correct for their different likelihood of being in employment, using as instrumental variable the marital status (married vs. not married).

when running

Code:

heckman job_sat i.job_1dgt, select(occupato=i.married   i.job_1dgt   )

Stata gives me the message that Dependent variable never censored because of selection, but why? Did I made some mistake in preparing the dataset?

Following my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(job_1dgt job_sat) float(occupato married)
.  . 0 0
.  . 0 0
3  2 1 0
2 10 1 1
6  9 1 0
8  6 1 1
5 10 1 0
3  8 1 0
5  6 1 0
5 10 1 0
7  9 1 1
4  9 1 0
.  . 0 0
5 10 1 0
2 10 1 0
6  6 1 0
8  7 1 0
.  . 0 0
.  . 0 1
4  8 1 0
end
label values job_1dgt prof1
label def prof1 2 "professioni intellettuali, scientifiche e di elevata special", modify
label def prof1 3 "professioni tecniche", modify
label def prof1 4 "professioni esecutive nel lavoro d'ufficio", modify
label def prof1 5 "professioni qualificate nelle attivitÃ  commerciali e nei se", modify
label def prof1 6 "artigiani, operai specializzati e agricoltori", modify
label def prof1 7 "conduttori di impianti, operai di macchinari fissi e mobili", modify
label def prof1 8 "professioni non qualificate", modify
label values job_sat c73
label def c73 10 "completamente soddisfatto", modify

I am really in troubles!! Thanks a lot in advance for your help.

Best, G.P.

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10298
#2

07 Jul 2022, 11:08

You need variation in your independent variables, otherwise how are you modeling the censoring process?
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#3

07 Jul 2022, 11:14

Dear Andrew, thanks for your answer. I do not mean what do you think that I need variation: my dependent variable has variation, but of course it is observed only within people employed and not in people outside employment... may you be clearer please?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10298
#4

07 Jul 2022, 11:25

As I said, the issue is with your independent variables. In selection models, your outcome is 0/continuous or missing/continuous. Therefore, you first model the censoring process (what determines selection to the 0 category) and then use the information that you obtain to model the continuous process. There is no variation in your independent variables to accomplish the former.
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#5

07 Jul 2022, 11:35

Sorry, but I do not see the point. You mean the independent in the selection outcomes? I have variation in it, isnt'it? So what should I fix?

Sorry but I am quite new to these models and I really do not see what I am missing.

Thanks, Giorgio
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10298
#6

07 Jul 2022, 11:43

In words, job_1dgt is always missing whenever the selection indicator is zero. As a result of listwise deletion, the remaining sample consists only of the observations for which job_sat is continuous, and you cannot model the censoring process.
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#7

07 Jul 2022, 20:16

Dear Andrew, thank you for the explanation, now it is totally clear.

But at this point, one question comes on my side: is there any way of studying the effect of some work-related variables (like job_characteristics in the case i did before) while controlling for selection into employment) on job satisfaction and at the same time controlling for the selection into employment? In other terms, can I include some covariates in the second equation which are systematically missing for individuals with value 0 in the selection equation?

Thanks really a lot
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10298
#8

08 Jul 2022, 12:50

Originally posted by Giorgio Piccitto View Post

In other terms, can I include some covariates in the second equation which are systematically missing for individuals with value 0 in the selection equation?

No - not if they coincide with the outcome - as they will eliminate the observations corresponding to the non-selected sample due to listwise deletion. The maximum likelihood estimator does the estimation in a single step, but you can do it in two stages. The first is a probit where the selection indicator is the outcome - and if your independent variables are available only for the positive category, you cannot estimate the model. The second stage is an augmented regression for the continuous sample (including inverse Mills ratios from stage 1).

Last edited by Andrew Musau; 08 Jul 2022, 12:53.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2207
#9

09 Jul 2022, 21:34

There is a way to solve this, but you require an instrumental variable for job_1dgt, and if you're putting in separate dummies for each outcome (except the base) it gets even harder. I discuss this in my 2010 MIT Press book. See Procedure 19.2. The idea is that y2, the missing explanatory variable, cannot be included in the probit, and so you have to instrument for it.
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#10

10 Jul 2022, 02:24

Dear Andrew and Jeff,

thanks a lot for your answer.

So Andrew, if I got correctly you suggest to do the procedure 'manually' with the two-steps method, estimating a first selection equation with occupation as Y and including only variables that are observed in both values of occupation (0 and 1, so basically variables that do not refer to a job characteristic) + the variable that I choose to meet the exclusion restriction; then I save the inverse Mills ratio, and in the final equation I can add all the variables I had in the selection equation (except for the exclusion restriction variable) + other variables referring to job characteristics. I correctly got your point?

Jeff, thanks a lot for your tip, I'll go to read the procedure you are suggesting.

Really thanks to both.

Best, Giorgio
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10298

#11

10 Jul 2022, 06:45

Originally posted by Giorgio Piccitto View Post

So Andrew, if I got correctly you suggest to do the procedure 'manually' with the two-steps method, estimating a first selection equation with occupation as Y and including only variables that are observed in both values of occupation (0 and 1, so basically variables that do not refer to a job characteristic) + the variable that I choose to meet the exclusion restriction; then I save the inverse Mills ratio, and in the final equation I can add all the variables I had in the selection equation (except for the exclusion restriction variable) + other variables referring to job characteristics. I correctly got your point?

Yes, but you do not need to do this manually as you can ask for the two-step consistent estimator in place of MLE. In case you have to do it manually, you need to correct the standard errors in the second stage, e.g., by bootstrapping.

Code:

webuse womenwk
g employed= !missing(wage)
heckman wage educ age, select(employed=married children educ age) twostep

*BY HAND
probit employed married children educ age
predict employedhat, xb
gen invMillsratio = normalden(employedhat)/normal(employedhat)


capture program drop se_correction
program define se_correction, eclass
tempname holding
cap drop employedhat invMillsratio
probit employed married children educ age
predict employedhat, xb
gen invMillsratio = normalden(employedhat)/normal(employedhat)
regress wage educ age invMillsratio
matrix `holding'=e(b)
ereturn post `holding'
ereturn local cmd="bootstrap"
end

bootstrap _b, reps(1000) nowarn nodots: se_correction

Res.:

Code:

. heckman wage educ age, select(employed=married children educ age) twostep

Heckman selection model -- two-step estimates   Number of obs     =      2,000
(regression model with sample selection)              Selected    =      1,343
                                                      Nonselected =        657

                                                Wald chi2(2)      =     442.54
                                                Prob > chi2       =     0.0000

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
wage         |
   education |   .9825259   .0538821    18.23   0.000     .8769189    1.088133
         age |   .2118695   .0220511     9.61   0.000     .1686502    .2550888
       _cons |   .7340391   1.248331     0.59   0.557    -1.712645    3.180723
-------------+----------------------------------------------------------------
employed     |
     married |   .4308575    .074208     5.81   0.000     .2854125    .5763025
    children |   .4473249   .0287417    15.56   0.000     .3909922    .5036576
   education |   .0583645   .0109742     5.32   0.000     .0368555    .0798735
         age |   .0347211   .0042293     8.21   0.000     .0264318    .0430105
       _cons |  -2.467365   .1925635   -12.81   0.000    -2.844782   -2.089948
-------------+----------------------------------------------------------------
/mills       |
      lambda |   4.001615   .6065388     6.60   0.000     2.812821     5.19041
-------------+----------------------------------------------------------------
         rho |    0.67284
       sigma |  5.9473529
------------------------------------------------------------------------------

. 
. 
. 
. *BY HAND

. 
. probit employed married children educ age

Iteration 0:   log likelihood = -1266.2225  
Iteration 1:   log likelihood = -1031.4962  
Iteration 2:   log likelihood = -1027.0625  
Iteration 3:   log likelihood = -1027.0616  
Iteration 4:   log likelihood = -1027.0616  

Probit regression                               Number of obs     =      2,000
                                                LR chi2(4)        =     478.32
                                                Prob > chi2       =     0.0000
Log likelihood = -1027.0616                     Pseudo R2         =     0.1889

------------------------------------------------------------------------------
    employed |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     married |   .4308575    .074208     5.81   0.000     .2854125    .5763025
    children |   .4473249   .0287417    15.56   0.000     .3909922    .5036576
   education |   .0583645   .0109742     5.32   0.000     .0368555    .0798735
         age |   .0347211   .0042293     8.21   0.000     .0264318    .0430105
       _cons |  -2.467365   .1925635   -12.81   0.000    -2.844782   -2.089948
------------------------------------------------------------------------------

. 
. bootstrap _b, reps(1000) nowarn nodots: se_correction

Bootstrap results                               Number of obs     =      2,000
                                                Replications      =      1,000

-------------------------------------------------------------------------------
              |   Observed   Bootstrap                         Normal-based
              |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
    education |   .9825259   .0525306    18.70   0.000     .8795678    1.085484
          age |   .2118695   .0226484     9.35   0.000     .1674794    .2562596
invMillsratio |   4.001616   .6006486     6.66   0.000     2.824366    5.178865
        _cons |   .7340391   1.229088     0.60   0.550    -1.674929    3.143008
-------------------------------------------------------------------------------

.

Comment

Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#12

28 Jul 2022, 04:33

Dear Andrew, I was trying to adopt your procedure in order to estimate the Heckman, but there is something I do not get: once I did the twostep Heckman, I tried to ask for the margins VAR, predict(ycond), but Stata gives me the answer that predict option yexpected not appropriate with margins.

Is this related to the twostep specification? How could I ask for margins in the proper way?

Thanks a lot, Giorgio
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10298
#13

01 Aug 2022, 13:35

See

Code:

help heckman postestimation

You do not need margins, but the predict command.

Code:

predict ycond, ycond
Comment

Announcement

Problem with Heckman model: Dependent variable never censored because of selection

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment