Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • subpop() vs. over, when calculating population standard deviations

    Dear all,

    I'm working with survey data in Stata v13.1, and trying to calculate weighted population standard deviations (SD) for subpopulations that are not random samples of the greater survey population (=population-based cohort with oversampling of specific trait), for which svyset is simply:
    Code:
    svyset _n [pw=weights]
    However, I find that slightly different results are observed depending on whether/how the "svy, subpop()" or "over"-option is used, in the following two situations where some data points are missing and are not imputed (assume that sample weights are not missing, example code below):

    1. when the subpopulation status (e.g. disease status, coded as 0/1, variable name z) is unknown for certain individuals (coded as .). When calculating the standard deviation for continuous variable x (which has no missing values) for category z=1 in the following 3 ways:
    Code:
    a) svy: mean x, over(z)
    b) svy, subpop(z): mean x
    c) svy, subpop(if z==1): mean x
    (each followed by) estat sd
    Options a) and b) automatically exclude individuals with missing subpopulation status to estimate the variance and produce the same SD, but option c) will include these individuals and therefore gives a different SD.

    2. when the subpopulation status (again z, coded 0/1) is known for all, but variable x has missing values (coded as .). Here, option b) and c) produce the same answer, but a) does not. Sidenote: unsurprisingly, when situations 1 and 2 are combined (e.g. partially overlapping missing values in both z and x), the three analysis options each produce a slightly different SD.

    While the estimated SDs are quite similar, I’m a bit at a loss which approach produces the technically correct answer. While I could find other statalist forum posts that discuss why a difference can be found (e.g. https://www.statalist.org/forums/for...-subpop-is-set, https://www.statalist.org/forums/for...-in-the-manual, https://www.statalist.org/forums/for...subpop-or-over), these do not discuss which approach is technically preferable. Can anyone perhaps could shed light on which one this would be for both situations, and when both situations occur at the same time (i.e. their combination)?

    Best, Jack

    ==
    Example code (I see slightly larger differences in my own dataset)
    Situation 1
    Code:
     . use http://www.stata-press.com/data/r13/nhanes2
    
    . replace female=. if _n<2500
    (2499 real changes made, 2499 to missing)
    
    . svy: mean bmi, over(female)
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      24        Number of obs    =      7852
    Number of PSUs   =      48        Population size  =  87791137
                                      Design df        =        24
    
                0: female = 0
                1: female = 1
    
    --------------------------------------------------------------
                 |             Linearized
            Over |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
    bmi          |
               0 |   25.56599   .0824397      25.39584    25.73614
               1 |   25.00118   .1103948      24.77333    25.22902
    --------------------------------------------------------------
    
    . estat sd
    
                0: female = 0
                1: female = 1
    
    -------------------------------------
            Over |       Mean   Std. Dev.
    -------------+-----------------------
    bmi          |
               0 |   25.56599     3.93822
               1 |   25.00118    5.363834
    -------------------------------------
    
    . svy, subpop(female): mean bmi
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      24        Number of obs    =      7852
    Number of PSUs   =      48        Population size  =  87791137
                                      Subpop. no. obs  =      4115
                                      Subpop. size     =  45674517
                                      Design df        =        24
    
    --------------------------------------------------------------
                 |             Linearized
                 |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
             bmi |   25.00118   .1103948      24.77333    25.22902
    --------------------------------------------------------------
    
    . estat sd
    
    -------------------------------------
                 |       Mean   Std. Dev.
    -------------+-----------------------
             bmi |   25.00118    5.363834
    -------------------------------------
    
    . svy, subpop(if female==1): mean bmi
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      24        Number of obs    =      7952
    Number of PSUs   =      48        Population size  =  89058869
                                      Subpop. no. obs  =      4115
                                      Subpop. size     =  45674517
                                      Design df        =        24
    
    --------------------------------------------------------------
                 |             Linearized
                 |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
             bmi |   25.00118   .1103948      24.77333    25.22902
    --------------------------------------------------------------
    Note: 7 strata omitted because they contain no subpopulation
          members.
    
    . estat sd
    
    -------------------------------------
                 |       Mean   Std. Dev.
    -------------+-----------------------
             bmi |   25.00118    5.396689
    -------------------------------------
    Situation 2
    Code:
    . use http://www.stata-press.com/data/r13/nhanes2
    
    . replace bmi=. if _n<2500
    (2499 real changes made, 2499 to missing)
    
    . svy: mean bmi, over(female)
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      24        Number of obs    =      7852
    Number of PSUs   =      48        Population size  =  87791137
                                      Design df        =        24
    
                0: female = 0
                1: female = 1
    
    --------------------------------------------------------------
                 |             Linearized
            Over |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
    bmi          |
               0 |   25.56599   .0824397      25.39584    25.73614
               1 |   25.00118   .1103948      24.77333    25.22902
    --------------------------------------------------------------
    
    . estat sd
    
                0: female = 0
                1: female = 1
    
    -------------------------------------
            Over |       Mean   Std. Dev.
    -------------+-----------------------
    bmi          |
               0 |   25.56599     3.93822
               1 |   25.00118    5.363834
    -------------------------------------
    
    . svy, subpop(female): mean bmi
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      24        Number of obs    =      7898
    Number of PSUs   =      48        Population size  =  88372231
                                      Subpop. no. obs  =      4115
                                      Subpop. size     =  45674517
                                      Design df        =        24
    
    --------------------------------------------------------------
                 |             Linearized
                 |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
             bmi |   25.00118   .1103948      24.77333    25.22902
    --------------------------------------------------------------
    Note: 7 strata omitted because they contain no subpopulation
          members.
    
    . estat sd
    
    -------------------------------------
                 |       Mean   Std. Dev.
    -------------+-----------------------
             bmi |   25.00118    5.386902
    -------------------------------------
    
    . svy, subpop(if female==1): mean bmi
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      24        Number of obs    =      7898
    Number of PSUs   =      48        Population size  =  88372231
                                      Subpop. no. obs  =      4115
                                      Subpop. size     =  45674517
                                      Design df        =        24
    
    --------------------------------------------------------------
                 |             Linearized
                 |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
             bmi |   25.00118   .1103948      24.77333    25.22902
    --------------------------------------------------------------
    Note: 7 strata omitted because they contain no subpopulation
          members.
    
    . estat sd
    
    -------------------------------------
                 |       Mean   Std. Dev.
    -------------+-----------------------
             bmi |   25.00118    5.386902
    -------------------------------------

  • #2
    Just to add detail to the example, showing how the differences between the estimated SDs can become more noticeably. Here, situation 1 and 2 are combined, with overlap in the individuals who have missing values in the subpopulation status (here female) or the variable for which the SD is to be calculated (here bmi). Any advice on the correct option to take would be highly appreciated.

    Code:
    . use http://www.stata-press.com/data/r13/nhanes2
    
    . replace bmi=. if age<35
    (3213 real changes made, 3213 to missing)
    
    . replace female=. if age>25&age<45
    (3151 real changes made, 3151 to missing)
    
    . gen bmim=1 if !missing(bmi)
    (3213 missing values generated)
    
    . replace bmim=0 if missing(bmi)
    (3213 real changes made)
    
    . gen fem=1 if !missing(female)
    (3151 missing values generated)
    
    . replace fem=0 if missing(female)
    (3151 real changes made)
    
    . tab fem bmim
    
               |         bmim
           fem |         0          1 |     Total
    -----------+----------------------+----------
             0 |     1,757      1,394 |     3,151 
             1 |     1,456      5,744 |     7,200 
    -----------+----------------------+----------
         Total |     3,213      7,138 |    10,351 
    
    
    . svy: mean bmi, over(female)
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      31        Number of obs    =      5744
    Number of PSUs   =      62        Population size  =  50396391
                                      Design df        =        31
    
                0: female = 0
                1: female = 1
    
    --------------------------------------------------------------
                 |             Linearized
            Over |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
    bmi          |
               0 |   26.03383   .0973661      25.83525    26.23241
               1 |   26.33067   .1571655      26.01013    26.65121
    --------------------------------------------------------------
    
    . estat sd
    
                0: female = 0
                1: female = 1
    
    -------------------------------------
            Over |       Mean   Std. Dev.
    -------------+-----------------------
    bmi          |
               0 |   26.03383    3.971603
               1 |   26.33067     5.52048
    -------------------------------------
    
    . svy, subpop(female): mean bmi
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      31        Number of obs    =      6444
    Number of PSUs   =      62        Population size  =  60230083
                                      Subpop. no. obs  =      3029
                                      Subpop. size     =  26689508
                                      Design df        =        31
    
    --------------------------------------------------------------
                 |             Linearized
                 |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
             bmi |   26.33067   .1571655      26.01013    26.65121
    --------------------------------------------------------------
    
    . estat sd
    
    -------------------------------------
                 |       Mean   Std. Dev.
    -------------+-----------------------
             bmi |   26.33067    5.697827
    -------------------------------------
    
    . svy, subpop(if female==1): mean bmi
    (running mean on estimation sample)
    
    Survey: Mean estimation
    
    Number of strata =      31       Number of obs    =       9595
    Number of PSUs   =      62       Population size  =  106800264
                                     Subpop. no. obs  =       3029
                                     Subpop. size     =   26689508
                                     Design df        =         31
    
    --------------------------------------------------------------
                 |             Linearized
                 |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
             bmi |   26.33067   .1571655      26.01013    26.65121
    --------------------------------------------------------------
    
    . estat sd
    
    -------------------------------------
                 |       Mean   Std. Dev.
    -------------+-----------------------
             bmi |   26.33067    6.217745
    -------------------------------------

    Comment


    • #3
      Hi Jack- I'm currently having this same issue. Did you ever find a solution to which approach produces the technically correct answer?
      Thanks,
      Megan

      Comment

      Working...
      X