Oaxaca Blinder decomposition - different number of observations with svy

Fabiola

Join Date: Aug 2014
Posts: 13

Oaxaca Blinder decomposition - different number of observations with svy

23 Aug 2023, 07:55

Dear all,
I'm using the Oaxaca decomposition command, in Stata 18, with svy subpopulation option.
I noticed that some observations are being excluded from the models within both Group 1 and Group 2. Additionally, I observed that in the decomposition model, the number of observations is the total sample size rather than the intended subpopulation size.
My subpopulation's sample size is 14,367, with Group 1 (lower education) comprising 2,192 observations and Group 2 constituting 12,175 observations.
When I run the decomposition without the svy option I get the correct subpopulation sample size in all the models.
I would greatly appreciate your guidance and advice regarding this issue.
Here are the outputs for the number of obsevations in the regression analyses for each group and the outputs for the number of observations in the oaxaca decomposition.

Thanks in advance for your help.

Total sample

Code:

svy, subpop(if subpop==1): logistic self age sex income badl visit eat prot_d2 dent

Code:

 Survey: Logistic regression

Number of strata =   574         Number of obs    =    90,846
Number of PSUs   = 8,027       Population size    =    168,426,190
                                             Subpop. no. obs    =    14,367
                                             Subpop. size    =    21,722,187.6
                                             Design df    =    7,453
                                             F(8, 7446)    =    75.68
                                             Prob > F    =    0.0000

Group 1

Code:

svy, subpop(if subpop==1 & ses==0): logistic self age sex income badl visit eat prot_d2 dent

Code:

Number of strata =   457                        Number of obs   =       80,899
Number of PSUs   = 7,085                        Population size =  135,901,805
                                                Subpop. no. obs =        2,192
                                                Subpop. size    = 2,505,225.62
                                                Design df       =        6,628
                                                F(8, 6621)      =        22.36
                                                Prob > F        =       0.0000

Group 2

Code:

svy, subpop(if subpop==1 & ses==1): logistic self age sex income badl visit eat prot_d2 dent

Code:

Number of strata =   573                         Number of obs   =      90,789
Number of PSUs   = 8,022                         Population size = 168,374,254
                                                 Subpop. no. obs =      12,175
                                                 Subpop. size    =  19,216,962
                                                 Design df       =       7,449
                                                 F(8, 7442)      =       55.80
                                                 Prob > F        =      0.0000

Here is the code I used for the decomposition

Code:

oaxaca self age sex income badl visit eat prot_d2 dent, ///
by(ses) logit weight(0) svy(,subpop(subpop)) noisily cformat(%4.3f)

Here is the output for the number of observations generated by the decomposition

Code:

Model for group 1
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =   456                        Number of obs   =       80,842
Number of PSUs   = 7,080                        Population size =  135,849,869
                                                Subpop. no. obs =        2,183
                                                Subpop. size    = 2,497,605.89
                                                Design df       =        6,624
                                                F(8, 6617)      =        22.28
                                                Prob > F        =       0.0000

Note: 117 strata omitted because they contain no subpopulation members.

Model for group 2
(running logit on estimation sample)

Survey: Logistic regression

Number of strata =   456                         Number of obs   =      80,842
Number of PSUs   = 7,080                         Population size = 135,849,869
                                                 Subpop. no. obs =      10,365
                                                 Subpop. size    =  14,703,327
                                                 Design df       =       6,624
                                                 F(8, 6617)      =       49.14
                                                 Prob > F        =      0.0000

Blinder-Oaxaca decomposition

Number of strata =   456                       Number of obs     =      80,842
Number of PSUs   = 7,080                       Population size   = 135,849,869
                                               Design df         =       6,624
                                               Model             =       logit
Group 1: ses = 0                               N of obs 1        =       7,353
Group 2: ses = 1                               N of obs 2        =      73,489

    explained: (X1 - X2) * b2
  unexplained: X1 * (b1 - b2)

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3338
#2

23 Aug 2023, 09:07

If you look at ereturn list, you'll see the actual number of observations in the subpop: e(N_sub)
Comment
Fabiola

Join Date: Aug 2014

Posts: 13
#3

29 Aug 2023, 06:18

Dear George,

Thank you for your assistance.

I followed your suggestion and examined the return list, and it appears that the number of observations in the groups corresponds to what is shown in the output. However, it seems that the decomposition is using the overall sample instead of the subpopulation, and some observations are being removed from the groups.

If you have any further guidance, it would be greatly appreciated.

Best regards
Comment
George Ford

Join Date: Aug 2014

Posts: 3338
#4

03 Sep 2023, 12:01

I used the auto.dta file.

ereturn list

shows the e(N_sub)

I see that oaxaca does not provide this though.

not sure why observations are being deleted, or if they actually are. hmmm?

Last edited by George Ford; 03 Sep 2023, 12:08.
Comment
George Ford

Join Date: Aug 2014

Posts: 3338
#5

03 Sep 2023, 12:04

what is the definition of "subpop" in the svy, subpop(subpop) part?

g
Comment

Fabiola

Join Date: Aug 2014
Posts: 13

04 Sep 2023, 11:34

Dear George,
Thank you for the follow-up.
Regarding the "subpop" variable, it indicates the observations to be included in my analyses.
It appears there might have been an inconsistency between the command versions. I did the analyses again using a previous version, which yielded the correct number of observations.
Thank you very much for your assistance.
Best regards,

Code:

Model for group 1
(running logit on estimation sample)
 
Survey: Logistic regression
 
Number of strata =   457                        Number of obs   =       80,899
Number of PSUs   = 7,085                        Population size =  135,901,805
                                                Subpop. no. obs =        2,192
                                                Subpop. size    = 2,505,225.62
                                                Design df       =        6,628
                                                F(8, 6621)      =        22.36
                                                Prob > F        =       0.0000
 
 
Model for group 2
(running logit on estimation sample)
 
Survey: Logistic regression
 
Number of strata =   573                         Number of obs   =      90,789
Number of PSUs   = 8,022                         Population size = 168,374,254
                                                 Subpop. no. obs =      12,175
                                                 Subpop. size    =  19,216,962
                                                 Design df       =       7,449
                                                 F(8, 7442)      =       55.80
                                                 Prob > F        =      0.0000
 
Blinder-Oaxaca decomposition
 
Number of strata =   574                         Number of obs   =      90,846
Number of PSUs   = 8,027                         Population size = 168,426,190
                                                 Design df       =       7,453
                                                Model              =     logit
Group 1: escol2_oaxaca = 0                      N of obs 1         =      2192
Group 2: escol2_oaxaca = 1                      N of obs 2         =     12175

Comment

Fabiola

Join Date: Aug 2014
Posts: 13

11 Oct 2024, 07:49

Dear @George Ford,

I recently updated the Oaxaca version and I'm encountering the same issue as before regarding the number of observations in my analyses. I’m using Stata 18.

My total population size is 22,728, with group 1 consisting of 18,011 individuals and group 2 having 4,717.

However, when I run the Oaxaca decomposition, I observe discrepancies in the number of individuals. Specifically:

In the model for group 1, I notice a loss in the number of observations for the subpopulation. The expected count should be 18,011, but I only see 16,990.
In the decomposition model, the total number of observations for both groups does not add up correctly. It should reflect 18,011 for group 1 and 4,717 for group 2.

Could you please provide guidance on how to resolve this issue?

Thank you for your assistance.

Here is the number of the population and individuals in the groups I want to analyze.

Code:

 svy, subpop(if v6>=60 & v7==1): tab group, obs percent format (%12.1f)
 
Number of strata =   574                      Number of obs   =      90,846
Number of PSUs   = 8,027                   Population size = 168,426,190
                                                         Subpop. no. obs =      22,728
                                                         Subpop. size    =  34,398,853
                                                         Design df       =       7,453

group                                                       percentage         obs
Some schooling (group 1)                           83.2     18011.0
No scholing       (group 2)                          16.8      4717.0
Total                                                        100.0     22728.0

Code:

   oaxaca outcome v1 v2 v3 v4, by(group) logit weight(1) svy(,subpop(if v6>=60 & v7==1)) noisily

Code:

  
Model for group 1
(running logit on estimation     sample)
Survey: Logistic regression
 
Number of strata =   522               Number of obs= 87,235
Number of PSUs   = 7,672            Population size=153,627,429
                                                  Subpop. no. obs=16,990
                                                  Subpop. size=25,504,979.2
                                                  Design df=7,150
                                                  F(4, 7147)=191.74
                                                  Prob > F =0.0000
 
Model for group 2
(running logit on estimation sample)
 
Survey: Logistic regression
 
Number of strata =   522              Number of obs=87,235
Number of PSUs   = 7,672           Population size=153,627,429
                                                 Subpop. no. obs= 4,717
                                                 Subpop. size    = 5,787,182.59
                                                 Design df=7,150
                                                 F(4, 7147)=8.87
                                                 Prob > F=0.0000
Blinder-Oaxaca decomposition
 
Number of strata =   522                       Number of obs =87,235
Number of PSUs   = 7,672                    Population size=153,627,429
                                                          Design df =7,150
                                                          Model =       logit
Group 1: group = 0                               N of obs 1 =      79,622
Group 2: group = 1                               N of obs 2 =       7,613

If I run the logit model for each of the groups I get the correct numbers

Code:

  svy, subpop(if v6>=60 & v7==1 & group==1): logit outcome v1 v2 v3 v4

Code:

  
Number of strata=522                           Number of obs =87,235
Number of PSUs=7,672                        Population size =153,627,429
                                                          Subpop. no. obs = 4,717
                                                          Subpop. size =5,787,182.59
                                                          Design df=7,150
                                                          F(4, 7147) =8.87
                                                          Prob > F = 0.0000

Code:

 svy, subpop(if v6>=60 & v7==1 & group==0): logit outcome v1 v2 v3 v4

Code:

Number of strata =   574                        Number of obs=90,846
Number of PSUs   = 8,027                     Population size =  168,426,190
                                                           Subpop. no. obs = 18,011
                                                           Subpop. size= 28,611,670.4
                                                           Design df =7,453
                                                           F(4, 7450)=172.14
                                                           Prob > F=0.0000

The same is true when I run the Oaxaca without the svy, althout the number of obs for goup 1 is lower, but the subpop sample is correct.

Announcement

Oaxaca Blinder decomposition - different number of observations with svy

Comment

Comment

Comment

Comment

Comment

Comment