subpop() vs. over, when calculating population standard deviations

Jack Worden

Join Date: Mar 2018
Posts: 2

subpop() vs. over, when calculating population standard deviations

19 Mar 2018, 06:33

Dear all,

I'm working with survey data in Stata v13.1, and trying to calculate weighted population standard deviations (SD) for subpopulations that are not random samples of the greater survey population (=population-based cohort with oversampling of specific trait), for which svyset is simply:

Code:

svyset _n [pw=weights]

However, I find that slightly different results are observed depending on whether/how the "svy, subpop()" or "over"-option is used, in the following two situations where some data points are missing and are not imputed (assume that sample weights are not missing, example code below):

1. when the subpopulation status (e.g. disease status, coded as 0/1, variable name z) is unknown for certain individuals (coded as .). When calculating the standard deviation for continuous variable x (which has no missing values) for category z=1 in the following 3 ways:

Code:

a) svy: mean x, over(z)
b) svy, subpop(z): mean x
c) svy, subpop(if z==1): mean x
(each followed by) estat sd

Options a) and b) automatically exclude individuals with missing subpopulation status to estimate the variance and produce the same SD, but option c) will include these individuals and therefore gives a different SD.

2. when the subpopulation status (again z, coded 0/1) is known for all, but variable x has missing values (coded as .). Here, option b) and c) produce the same answer, but a) does not. Sidenote: unsurprisingly, when situations 1 and 2 are combined (e.g. partially overlapping missing values in both z and x), the three analysis options each produce a slightly different SD.

While the estimated SDs are quite similar, I’m a bit at a loss which approach produces the technically correct answer. While I could find other statalist forum posts that discuss why a difference can be found (e.g. https://www.statalist.org/forums/for...-subpop-is-set, https://www.statalist.org/forums/for...-in-the-manual, https://www.statalist.org/forums/for...subpop-or-over), these do not discuss which approach is technically preferable. Can anyone perhaps could shed light on which one this would be for both situations, and when both situations occur at the same time (i.e. their combination)?

Best, Jack

==
Example code (I see slightly larger differences in my own dataset)
Situation 1

Code:

 . use http://www.stata-press.com/data/r13/nhanes2

. replace female=. if _n<2500
(2499 real changes made, 2499 to missing)

. svy: mean bmi, over(female)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      24        Number of obs    =      7852
Number of PSUs   =      48        Population size  =  87791137
                                  Design df        =        24

            0: female = 0
            1: female = 1

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
bmi          |
           0 |   25.56599   .0824397      25.39584    25.73614
           1 |   25.00118   .1103948      24.77333    25.22902
--------------------------------------------------------------

. estat sd

            0: female = 0
            1: female = 1

-------------------------------------
        Over |       Mean   Std. Dev.
-------------+-----------------------
bmi          |
           0 |   25.56599     3.93822
           1 |   25.00118    5.363834
-------------------------------------

. svy, subpop(female): mean bmi
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      24        Number of obs    =      7852
Number of PSUs   =      48        Population size  =  87791137
                                  Subpop. no. obs  =      4115
                                  Subpop. size     =  45674517
                                  Design df        =        24

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         bmi |   25.00118   .1103948      24.77333    25.22902
--------------------------------------------------------------

. estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
         bmi |   25.00118    5.363834
-------------------------------------

. svy, subpop(if female==1): mean bmi
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      24        Number of obs    =      7952
Number of PSUs   =      48        Population size  =  89058869
                                  Subpop. no. obs  =      4115
                                  Subpop. size     =  45674517
                                  Design df        =        24

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         bmi |   25.00118   .1103948      24.77333    25.22902
--------------------------------------------------------------
Note: 7 strata omitted because they contain no subpopulation
      members.

. estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
         bmi |   25.00118    5.396689
-------------------------------------

Situation 2

Code:

. use http://www.stata-press.com/data/r13/nhanes2

. replace bmi=. if _n<2500
(2499 real changes made, 2499 to missing)

. svy: mean bmi, over(female)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      24        Number of obs    =      7852
Number of PSUs   =      48        Population size  =  87791137
                                  Design df        =        24

            0: female = 0
            1: female = 1

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
bmi          |
           0 |   25.56599   .0824397      25.39584    25.73614
           1 |   25.00118   .1103948      24.77333    25.22902
--------------------------------------------------------------

. estat sd

            0: female = 0
            1: female = 1

-------------------------------------
        Over |       Mean   Std. Dev.
-------------+-----------------------
bmi          |
           0 |   25.56599     3.93822
           1 |   25.00118    5.363834
-------------------------------------

. svy, subpop(female): mean bmi
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      24        Number of obs    =      7898
Number of PSUs   =      48        Population size  =  88372231
                                  Subpop. no. obs  =      4115
                                  Subpop. size     =  45674517
                                  Design df        =        24

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         bmi |   25.00118   .1103948      24.77333    25.22902
--------------------------------------------------------------
Note: 7 strata omitted because they contain no subpopulation
      members.

. estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
         bmi |   25.00118    5.386902
-------------------------------------

. svy, subpop(if female==1): mean bmi
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      24        Number of obs    =      7898
Number of PSUs   =      48        Population size  =  88372231
                                  Subpop. no. obs  =      4115
                                  Subpop. size     =  45674517
                                  Design df        =        24

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         bmi |   25.00118   .1103948      24.77333    25.22902
--------------------------------------------------------------
Note: 7 strata omitted because they contain no subpopulation
      members.

. estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
         bmi |   25.00118    5.386902
-------------------------------------

Tags: None

Jack Worden

Join Date: Mar 2018
Posts: 2

20 Mar 2018, 03:23

Just to add detail to the example, showing how the differences between the estimated SDs can become more noticeably. Here, situation 1 and 2 are combined, with overlap in the individuals who have missing values in the subpopulation status (here female) or the variable for which the SD is to be calculated (here bmi). Any advice on the correct option to take would be highly appreciated.

Code:

. use http://www.stata-press.com/data/r13/nhanes2

. replace bmi=. if age<35
(3213 real changes made, 3213 to missing)

. replace female=. if age>25&age<45
(3151 real changes made, 3151 to missing)

. gen bmim=1 if !missing(bmi)
(3213 missing values generated)

. replace bmim=0 if missing(bmi)
(3213 real changes made)

. gen fem=1 if !missing(female)
(3151 missing values generated)

. replace fem=0 if missing(female)
(3151 real changes made)

. tab fem bmim

           |         bmim
       fem |         0          1 |     Total
-----------+----------------------+----------
         0 |     1,757      1,394 |     3,151 
         1 |     1,456      5,744 |     7,200 
-----------+----------------------+----------
     Total |     3,213      7,138 |    10,351 


. svy: mean bmi, over(female)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      31        Number of obs    =      5744
Number of PSUs   =      62        Population size  =  50396391
                                  Design df        =        31

            0: female = 0
            1: female = 1

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
bmi          |
           0 |   26.03383   .0973661      25.83525    26.23241
           1 |   26.33067   .1571655      26.01013    26.65121
--------------------------------------------------------------

. estat sd

            0: female = 0
            1: female = 1

-------------------------------------
        Over |       Mean   Std. Dev.
-------------+-----------------------
bmi          |
           0 |   26.03383    3.971603
           1 |   26.33067     5.52048
-------------------------------------

. svy, subpop(female): mean bmi
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      31        Number of obs    =      6444
Number of PSUs   =      62        Population size  =  60230083
                                  Subpop. no. obs  =      3029
                                  Subpop. size     =  26689508
                                  Design df        =        31

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         bmi |   26.33067   .1571655      26.01013    26.65121
--------------------------------------------------------------

. estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
         bmi |   26.33067    5.697827
-------------------------------------

. svy, subpop(if female==1): mean bmi
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      31       Number of obs    =       9595
Number of PSUs   =      62       Population size  =  106800264
                                 Subpop. no. obs  =       3029
                                 Subpop. size     =   26689508
                                 Design df        =         31

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         bmi |   26.33067   .1571655      26.01013    26.65121
--------------------------------------------------------------

. estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
         bmi |   26.33067    6.217745
-------------------------------------

Comment

Megan Bronson

Join Date: Feb 2019

Posts: 2
#3

01 Feb 2019, 08:45

Hi Jack- I'm currently having this same issue. Did you ever find a solution to which approach produces the technically correct answer?
Thanks,
Megan
Comment

Announcement

subpop() vs. over, when calculating population standard deviations

Comment

Comment