Heckman selection model with Blinder-Oaxaca Decomposition

Will Murphy

Join Date: Feb 2020

Posts: 52
#1

Heckman selection model with Blinder-Oaxaca Decomposition

03 Apr 2020, 09:05

I am trying to decompose the log wage gap between the non-disabled (DISTYPE = 1) and work-limited disabled (DISTYPE =2) into 'explained and unexplained' components for males by a Blinder-Oaxaca decomposition that accounts for those unemployed (GRSSWK = 0) via a Heckman selection method. I have looked at online resources, including Ben Jann, but each alteration made has nevertheless not allowed me to run this.

I created dummy variables for each value of each categorical variable and then tried applying this (simplified) command:

Code:

oaxaca logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & inlist(DISTYPE,1,2) model2(heckman, twostep select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2)) pooled

(to note: some dummy variables included in the wage equation = 0 for a categorical variable and the lfp equation excludes any industry variables). The command states that 'option by() required', however I am unconvinced, even if this problem is addressed, Stata will run this.

So I was wondering if any clarity could be provided whether I am anywhere near the correct code for this? Many thanks.
Tags: categorical, dummy variable, heckman, probit, regression
Will Murphy

Join Date: Feb 2020

Posts: 52
#2

03 Apr 2020, 10:45

Just to clarify, I have been able to run an uncorrected decomposition. For the example above,

Code:

xi: oaxaca logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & inlist(DISTYPE,1,2), by(DISTYPE) pooled relax

However, I cannot work out how to run the corrected decomposition via Heckman (whereby a labour force participation (lfp) equation is modelled).
Any help would be greatly appreciated regarding its integration into the above code, thank you.
Comment
Will Murphy

Join Date: Feb 2020

Posts: 52
#3

04 Apr 2020, 06:14

Following from #2, another question regards time. All my variables (except ethnicity and gender) have 1 or 5 at the end to detail quarter 1 or quarter 5. I ran the code in #2 (that included 'pooled') after dropping observations in quarter 5.

Am I best to keep quarter 5 observations, reshape my data and specify 'by(quarter)' if I am assessing how the decomposition changes over time using 'pooled'? And, if so, how would I do this if I have already specified 'by(DISTYPE1) in #2?

Or can I just do Q1 and Q5 separately, using 'pooled' both times, and assess the differences in wage decompositions (especially explained and unexplained)

If anybody could help with #2 and this, it would be greatly appreciated.

Last edited by Will Murphy; 04 Apr 2020, 06:40.
Comment

Sven-Kristjan Bormann

Join Date: Jul 2018
Posts: 310

04 Apr 2020, 12:23

Your code in #1 can't work because you miss a colon after the inlist.

Code:

 
  oaxaca logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & inlist(DISTYPE,1,2), model2(heckman, twostep select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2)) pooled

If you want to compare changes over time, then you should run the Smith-Welch decomposition. In Stata, you can use the smithwelch-command which is also by Ben Jann. The code for the decomposition in your case could look like this:

Code:

ssc install smitchwelch // in case you don't have the command yet
heckman logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & DISTTYPE==1 & quarter1 /* assuming this your time variable */, select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2) twostep
est store dist1q1

heckman logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & DISTTYPE==1 & quarter5 /* assuming this your time variable */, select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2) twostep
est store dist1q5

heckman logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & DISTTYPE==2 & quarter1 /* assuming this your time variable */, select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2) twostep
est store dist2q1

heckman logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & DISTTYPE==2 & quarter5 /* assuming this your time variable */, select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2) twostep
est store dist2q5

smitchwelch dist1q1 dist2q1 dist1q5 dist2q2

Comment

Will Murphy

Join Date: Feb 2020
Posts: 52

04 Apr 2020, 14:03

Originally posted by Sven-Kristjan Bormann View Post

Your code in #1 can't work because you miss a colon after the inlist.

Code:

oaxaca logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & inlist(DISTYPE,1,2), model2(heckman, twostep select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2)) pooled

Code:

ssc install smitchwelch // in case you don't have the command yet
heckman logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & DISTTYPE==1 & quarter1 /* assuming this your time variable */, select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2) twostep
est store dist1q1

heckman logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & DISTTYPE==1 & quarter5 /* assuming this your time variable */, select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2) twostep
est store dist1q5

heckman logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & DISTTYPE==2 & quarter1 /* assuming this your time variable */, select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2) twostep
est store dist2q1

heckman logGRSSWK1 i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.dREGWKR1 i.dREGWKR2 i.dIND1 i.dIND1 if MALE == 1 & DISTTYPE==2 & quarter5 /* assuming this your time variable */, select(lfp = i.WHITE i.dAGES11 i.dAGES12 i.dAGES13 i.dRES1 i.dRES2 i.dRES3 i.MARRIAGE1 i.MARRIAGE2) twostep
est store dist2q5

smitchwelch dist1q1 dist2q1 dist1q5 dist2q2

Thank you very much for the helpful reply and code, Sven. Regarding

Code:

 smithwelch dist1q1 dist2q1 dist1q5 dist2q5

it resulted in 'invalid syntax'. I then double checked it wasn't me via :

Code:

 smithwelch _est_dist1q1 _est_dist2q1 _est_dist1q5 _est_dist2q5

which also did not work. I checked the estimates stored and they're 1s and 0s, is this fine?

Any help could be provided to resolve this? It would be greatly appreciated.

Comment

Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#6

04 Apr 2020, 17:59

You did not give an example of your data, so my code is just an idea of what might work. Especially, I have no idea how your quarter variable looks like. So that might be a potential source of error.

I checked the estimates stored and they're 1s and 0s, is this fine?

I don't understand what you mean.
Comment
Will Murphy

Join Date: Feb 2020

Posts: 52
#7

05 Apr 2020, 04:01

Originally posted by Sven-Kristjan Bormann View Post

You did not give an example of your data, so my code is just an idea of what might work. Especially, I have no idea how your quarter variable looks like. So that might be a potential source of error.
I don't understand what you mean.

My apologies, an example of my dataset is presented below- whereby variables ended in e.g 5 to represent the quarter 5 value. Thus I had to reshape long.

[CODE]

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float id byte quarter float logGRSSWK byte DISTYPE 32 1 6.175867 4 32 5 6.263398 4 38 1 5.749393 4 38 5 5.910797 4 end label values DISTYPE DISTYPE label def DISTYPE 4 "Non-disabled", modify

/CODE]

Please disregard the question regarding 1s and 0s.
Comment
Will Murphy

Join Date: Feb 2020

Posts: 52
#8

05 Apr 2020, 04:17

If I may, to avoid consuming any more of your time, lastly ask two quick follow-up questions to your oaxaca command, because I wish to compare the results to Smith-Welch and assess discrimination.

1. After taking your advice on board from #4, I also wanted to implement a 'by quarter' command, however the following two commands I tried each proved unsuccessful:

Code:

by quarter DISTYPE, sort: oaxaca $wageeq if inlist(DISTYPE,1,2), model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) pooled relax oaxaca $wageeq if inlist(DISTYPE,1,2), by(DISTYPE quarter) model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) pooled relax

The former said by() required whilst the latter said that by() too many variables specified.
Is there any solution to this problem?

2. I have naively used 'pooled' in my oaxaca commands to ensure the twofold decomposition occurs whereby I can obtain explained and unexplained components. I have read online resources to understand what pooled implies however they are difficult to apply to my current context. I was wondering if any clarity could be provided?

Many thanks.
Comment
Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#9

05 Apr 2020, 15:41

Regarding your question number 1: I don't understand why you want to avoid the required by()-option. The by()-option takes exactly one binary variable as a group indicator as input.
The only way around this restriction is to rewrite to the oaxaca-command such that it allows more variables for the by()-option or to remove the requirement of the by()-option.
Why don't you want to run oaxaca separate for each quarter like the code below?

Code:

oaxaca $wageeq if inlist(DISTYPE,1,2) & quarter==1, by(DISTYPE) model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) pooled relax oaxaca $wageeq if inlist(DISTYPE,1,2) & quarter==5, by(DISTYPE) model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) pooled relax

Regarding your second question:

I have read online resources to understand what pooled implies however they are difficult to apply to my current context.

I don't understand what you mean. Why are they difficult to apply to your current context?
Comment
Will Murphy

Join Date: Feb 2020

Posts: 52
#10

06 Apr 2020, 03:34

Originally posted by Sven-Kristjan Bormann View Post

Regarding your second question: I don't understand what you mean. Why are they difficult to apply to your current context?

Thank you for the reply. I wish to assess the explained and unexplained components of the wage gap, which I believe means I would have to run a twofold oaxaca decomposition. I was blindly/ naively attempting to use 'pooled' to enable this without knowing what its inclusion does.

Jann (2008) states: "The pooled option also causes the coefficients from a pooled model to be used, but now the pooled model also contains a group membership indicator". Apologies if I am being foolish, but I don't understand in my context of assessing the male wage gap of Non-disabled (DISTYPE =1) and work-limited disabled (DISTYPE=2) (bearing in mind DISTYPE can also equal 3), what the pooled model would be? Because when I attempt to use this code:

Originally posted by Sven-Kristjan Bormann View Post

oaxaca $wageeq if inlist(DISTYPE,1,2) & quarter==1, by(DISTYPE) model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) pooled relax

I get the issue of coexistence of factor-variable and time-series operators, despite generating dummy variables for categorical ones and implementing them in my wage and select equations, e.g:

Code:

tab AGES, gen(dAGES) global wageeq "i.WHITE dAGES1 dAGES2 dAGES3..

So, using 'xi:' at the start but with 'noisily' instead of 'relax' (due to some zero variance coefficients in both models) works BUT only as a regression model when doing a threefold decomposition

Code:

xi: oaxaca logGRSSWK $wageeq if inlist(DISTYPE,1,4) & quarter==1, by(DISTYPE) model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) noisily

and at the end of the regression coefficient tabulations I get:

Code:

dropped coefficients or zero variances encountered specify -relax- to ingnore r(499);

When using the same code with 'relax noisily', as well as 'pooled noisily' for the twofold, I get the code:

Code:

Dependent variable never censored because of selection: model would simplify to OLS regression r(498);

Is there anything I can do to at least run a valid threefold oaxaca with heckman because it is treating it like a regression model with heckman- maybe amalgamate the categorical variable labels to ensure no resulting dummy variable has zero variance coefficient? Thank you.

Last edited by Will Murphy; 06 Apr 2020, 04:01.
Comment
Will Murphy

Join Date: Feb 2020

Posts: 52
#11

06 Apr 2020, 04:45

To note: using a threefold oaxaca with heckman and 'relax' instead of 'noisily' does yield threefold oaxaca tables but not accounting for any changes that should be induced by heckman, stating that both model 1 and 2 have zero variance coefficients.
Also, apologies for any confusion: DISTYPE can take 1 if work-limited disabled, 2 if daily activity limited disabled and 4 if non-disabled. I know in previous codes I have referred otherwise but it was as a matter of simplification.

Last edited by Will Murphy; 06 Apr 2020, 05:39.
Comment

Will Murphy

Join Date: Feb 2020
Posts: 52

#12

06 Apr 2020, 07:27

Originally posted by Sven-Kristjan Bormann View Post

Why don't you want to run oaxaca separate for each quarter like the code below?

Code:

oaxaca $wageeq if inlist(DISTYPE,1,2) & quarter==1, by(DISTYPE) model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) pooled relax
oaxaca $wageeq if inlist(DISTYPE,1,2) & quarter==5, by(DISTYPE) model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) pooled relax

Apologies Sven, I have been trying to solve my questions and can now provide a clearer question if ok? Instead of using 'xi' I have used 'normalize' in my wage equation such that by running the following:

Code:

 oaxaca logGRSSWK $wageeq if inlist(DISTYPE,1,4) & quarter == 1, by(DISTYPE)
model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq)) noisily relax

I obtain a sufficient threefold (with heckman) coefficient but whilst the heckman occurs eg with models 1 and 2 (model 1 only shown below):

HTML Code:

Model for group 1
note: dWHITE2 dropped because of collinearity
note: dAGE1 dropped because of collinearity
note: dEDUCATION3 dropped because of collinearity
note: dMARITALSTATUS3 dropped because of collinearity
note: dCHILDREN4 dropped because of collinearity
note: dRESIDENCE3 dropped because of collinearity
note: dWORKREGION2 dropped because of collinearity
note: dWORKREGION6 dropped because of collinearity
note: dEDUCATION2 dropped because of collinearity

Heckman selection model -- two-step estimates   Number of obs     =        299
(regression model with sample selection)              Selected    =         62
                                                      Nonselected =        237

                                                Wald chi2(22)     =      49.38
                                                Prob > chi2       =     0.0007

---------------------------------------------------------------------------------
      logGRSSWK |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
logGRSSWK       |
        dWHITE2 |   1.382135   .5484004     2.52   0.012     .3072904     2.45698
          dAGE2 |   .2769868   .5706989     0.49   0.627    -.8415625    1.395536
          dAGE3 |  -.2430792   .5757544    -0.42   0.673    -1.371537    .8853787
          dAGE4 |    .043933   .5370769     0.08   0.935    -1.008718    1.096584
          dAGE5 |   -.301713   .4878076    -0.62   0.536    -1.257798    .6543723
    dRESIDENCE1 |  -.5625492   .3800708    -1.48   0.139    -1.307474    .1823758
    dRESIDENCE2 |  -1.428457     .71982    -1.98   0.047    -2.839278   -.0176359
    dRESIDENCE4 |   .1827752   .4717347     0.39   0.698    -.7418078    1.107358
    dRESIDENCE5 |  -.5827875   .4025659    -1.45   0.148    -1.371802    .2062272
   dWORKREGION3 |   .6196988   .7701736     0.80   0.421    -.8898136    2.129211
   dWORKREGION4 |  -.8755692   .4103622    -2.13   0.033    -1.679864   -.0712741
   dWORKREGION5 |  -.5952795   .4452076    -1.34   0.181     -1.46787    .2773113
     dINDUSTRY3 |  -.3751538   .2356986    -1.59   0.111    -.8371145    .0868069
     dINDUSTRY4 |  -.2147377   .2406301    -0.89   0.372    -.6863641    .2568887
     dINDUSTRY5 |  -.0851482   .2495602    -0.34   0.733    -.5742771    .4039808
     dINDUSTRY6 |   .1276264   .2239924     0.57   0.569    -.3113906    .5666435
     dINDUSTRY7 |  -.1199129   .2576881    -0.47   0.642    -.6249722    .3851464
     dINDUSTRY8 |  -.6659574   .4706148    -1.42   0.157    -1.588345    .2564308
     dINDUSTRY9 |  -.4692162   .3326219    -1.41   0.158    -1.121143    .1827106
    dEDUCATION1 |   .1984278   .1459934     1.36   0.174    -.0877141    .4845697
    dJOBTENURE2 |  -.1833522   .2226105    -0.82   0.410    -.6196606    .2529563
    dJOBTENURE3 |   .2315784   .1918933     1.21   0.228    -.1445255    .6076823
          _cons |   5.929038   .9664496     6.13   0.000     4.034831    7.823244
----------------+----------------------------------------------------------------
select          |
        dWHITE1 |  -.7267648   .7198301    -1.01   0.313    -2.137606    .6840764
          dAGE2 |   1.473688   .6904321     2.13   0.033     .1204661     2.82691
          dAGE3 |   1.410782   .6285173     2.24   0.025     .1789109    2.642653
          dAGE4 |   .8470128   .6171357     1.37   0.170     -.362551    2.056577
          dAGE5 |   .4545676   .6441714     0.71   0.480     -.807985     1.71712
    dRESIDENCE1 |  -.4479788   .2637305    -1.70   0.089     -.964881    .0689235
    dRESIDENCE2 |  -.2876624   .2768774    -1.04   0.299     -.830332    .2550073
    dRESIDENCE4 |  -.0811826   .4403324    -0.18   0.854    -.9442182     .781853
    dRESIDENCE5 |  -.2821816   .3914729    -0.72   0.471    -1.049454    .4850911
    dRESIDENCE6 |  -.9099623   .3272936    -2.78   0.005    -1.551446   -.2684787
    dEDUCATION1 |  -.0309179   .2009303    -0.15   0.878    -.4247341    .3628983
dMARITALSTATUS1 |   .4888559   .2661098     1.84   0.066    -.0327098    1.010422
dMARITALSTATUS2 |   .4727092   .3014175     1.57   0.117    -.1180582    1.063477
     dCHILDREN1 |   -.772282   .4433357    -1.74   0.082    -1.641204      .09664
     dCHILDREN2 |  -.2712228   .4939069    -0.55   0.583    -1.239262    .6968169
     dCHILDREN3 |   .2120907   .5079802     0.42   0.676    -.7835321    1.207714
     dCHILDREN5 |  -5.346187          .        .       .            .           .
          _cons |   -.868004   .6779183    -1.28   0.200    -2.196699    .4606914
----------------+----------------------------------------------------------------
mills           |
         lambda |  -.4767588   .2459464    -1.94   0.053    -.9588049    .0052873
----------------+----------------------------------------------------------------
            rho |   -0.84768
          sigma |  .56242887
---------------------------------------------------------------------------------
(dRESIDENCE3 dWORKREGION2 dWORKREGION6 dEDUCATION2 dropped from model 1)
(model 1 has zero variance coefficients)

But the results for the oaxaca decomposition part (shown below) I believe only show the results for the selected observations of DISTYPE = 1 (62) and not the total number of observations:

HTML Code:

Blinder-Oaxaca decomposition                    Number of obs     =        594
                                                  Model           =     linear
Group 1: DISTYPE = 1                              N of obs 1      =         62
Group 2: DISTYPE = 4                              N of obs 2      =        532

------------------------------------------------------------------------------
   logGRSSWK |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
overall      |
     group_1 |   6.568103   .2910431    22.57   0.000     5.997669    7.138537
     group_2 |   6.511034   .0700499    92.95   0.000     6.373738    6.648329
  difference |   .0570693   .2993544     0.19   0.849    -.5296546    .6437932
  endowments |  -.0384598   .0360136    -1.07   0.286    -.1090451    .0321256
coefficients |  -.0064642   .2993338    -0.02   0.983    -.5931478    .5802193
 interaction |   .1019933   .0706072     1.44   0.149    -.0363943    .2403809

I was wondering if this is actually correct or not (I am under the impression it is the latter, perhaps resulting from both models having zero variance coefficients)?
Apologies once again.

Comment

Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#13

06 Apr 2020, 15:02

But the results for the oaxaca decomposition part (shown below) I believe only show the results for the selected observations of DISTYPE = 1 (62) and not the total number of observations:

Which number of observations did you expect? I don't think that you have shown before a description of your dataset (e.g. the output of the describe-command).
Maybe you need to post also the output from model 2 to get a complete picture of the situation.

At the moment, I am also not sure if you still want an answer to your previous questions.
Comment
Will Murphy

Join Date: Feb 2020

Posts: 52
#14

06 Apr 2020, 15:33

Originally posted by Sven-Kristjan Bormann View Post

Which number of observations did you expect?

I may be wrong but I was under the impression that the Oaxaca part (the 2nd HTML code of #12) would have said 'N of obs 1 = 299' (the 62 selected and the 237 non selected but accounted for by Heckman (in the 1st HTML code #12))? I may be mistaken but I thought only observing the 62 selected disregards the use of heckman here to provide an estimation of the wages of the non selected (i.e those currently not in work)?

If I am right regarding this please let me know and I will show the description of the dataset/ whatever is needed.

Originally posted by Sven-Kristjan Bormann View Post

At the moment, I am also not sure if you still want an answer to your previous questions.

My apologies, my line of methodology is to compare the uncorrected with the corrected oaxaca (hence my thinking that my oaxaca finding from #12 using heckman will be no different to it without) and therefore was wondering what the use of 'pooled' in a code such as:

Code:

oaxaca $wageeq if inlist(DISTYPE,1,4) & quarter == 1, by(DISTYPE) pooled relax

implies? I understand it uses coefficients from a pooled model over both groups as reference coefficients. Am I right in thinking both groups here implies DISTYPE =1 and DISTYPE =4?

Any clarity regarding these two issues, especially the former, would be greatly appreciated. Thank you.
Comment

Will Murphy

Join Date: Feb 2020
Posts: 52

#15

07 Apr 2020, 15:11

Sven, please forgive me for keep answering my own questions however my former query in #14 can be explained through comparing the coefficients with and without heckman and noting that, whilst the number of observations in the oaxaca part does stay the same, the coefficients for both groups change- the result I believe of heckman.

Regarding the latter, I am now aware that there is a problem within Stata whereby oaxaca restricts the sample for the pooled model to observations for which wages are not missing- so the pooled model does not work with heckman. Instead I will use e.g weight(0).

So my only question in this post (and my final in this thread) is that I obtain a lot of statistically insignificant p-values nearly everything especially the most important, e.g:

Code:

 oaxaca logGRSSWK $wageeq if inlist(DISTYPE,1,4) & quarter == 1, by(DISTYPE)
model1(heckman, twostep select($seleq)) model2(heckman, twostep select($seleq))
weight(0) noisily relax

HTML Code:

Blinder-Oaxaca decomposition                    Number of obs     =        566
                                                  Model           =     linear
Group 1: DISTYPE = 1                              N of obs 1      =         58
Group 2: DISTYPE = 4                              N of obs 2      =        508

------------------------------------------------------------------------------
   logGRSSWK |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
overall      |
     group_1 |   6.452112   .2393385    26.96   0.000     5.983017    6.921207
     group_2 |   6.461553   .0648995    99.56   0.000     6.334352    6.588753
  difference |  -.0094405   .2479816    -0.04   0.970    -.4954755    .4765945
   explained |  -.0314521   .0372299    -0.84   0.398    -.1044212    .0415171
 unexplained |   .0220116   .2491664     0.09   0.930    -.4663457    .5103689

Is there anything I can do to test/ resolve this issue? Thanks.

Announcement