Dear all,
I'm working with survey data in Stata v13.1, and trying to calculate weighted population standard deviations (SD) for subpopulations that are not random samples of the greater survey population (=population-based cohort with oversampling of specific trait), for which svyset is simply:
However, I find that slightly different results are observed depending on whether/how the "svy, subpop()" or "over"-option is used, in the following two situations where some data points are missing and are not imputed (assume that sample weights are not missing, example code below):
1. when the subpopulation status (e.g. disease status, coded as 0/1, variable name z) is unknown for certain individuals (coded as .). When calculating the standard deviation for continuous variable x (which has no missing values) for category z=1 in the following 3 ways:
Options a) and b) automatically exclude individuals with missing subpopulation status to estimate the variance and produce the same SD, but option c) will include these individuals and therefore gives a different SD.
2. when the subpopulation status (again z, coded 0/1) is known for all, but variable x has missing values (coded as .). Here, option b) and c) produce the same answer, but a) does not. Sidenote: unsurprisingly, when situations 1 and 2 are combined (e.g. partially overlapping missing values in both z and x), the three analysis options each produce a slightly different SD.
While the estimated SDs are quite similar, I’m a bit at a loss which approach produces the technically correct answer. While I could find other statalist forum posts that discuss why a difference can be found (e.g. https://www.statalist.org/forums/for...-subpop-is-set, https://www.statalist.org/forums/for...-in-the-manual, https://www.statalist.org/forums/for...subpop-or-over), these do not discuss which approach is technically preferable. Can anyone perhaps could shed light on which one this would be for both situations, and when both situations occur at the same time (i.e. their combination)?
Best, Jack
==
Example code (I see slightly larger differences in my own dataset)
Situation 1
Situation 2
I'm working with survey data in Stata v13.1, and trying to calculate weighted population standard deviations (SD) for subpopulations that are not random samples of the greater survey population (=population-based cohort with oversampling of specific trait), for which svyset is simply:
Code:
svyset _n [pw=weights]
1. when the subpopulation status (e.g. disease status, coded as 0/1, variable name z) is unknown for certain individuals (coded as .). When calculating the standard deviation for continuous variable x (which has no missing values) for category z=1 in the following 3 ways:
Code:
a) svy: mean x, over(z) b) svy, subpop(z): mean x c) svy, subpop(if z==1): mean x (each followed by) estat sd
2. when the subpopulation status (again z, coded 0/1) is known for all, but variable x has missing values (coded as .). Here, option b) and c) produce the same answer, but a) does not. Sidenote: unsurprisingly, when situations 1 and 2 are combined (e.g. partially overlapping missing values in both z and x), the three analysis options each produce a slightly different SD.
While the estimated SDs are quite similar, I’m a bit at a loss which approach produces the technically correct answer. While I could find other statalist forum posts that discuss why a difference can be found (e.g. https://www.statalist.org/forums/for...-subpop-is-set, https://www.statalist.org/forums/for...-in-the-manual, https://www.statalist.org/forums/for...subpop-or-over), these do not discuss which approach is technically preferable. Can anyone perhaps could shed light on which one this would be for both situations, and when both situations occur at the same time (i.e. their combination)?
Best, Jack
==
Example code (I see slightly larger differences in my own dataset)
Situation 1
Code:
. use http://www.stata-press.com/data/r13/nhanes2 . replace female=. if _n<2500 (2499 real changes made, 2499 to missing) . svy: mean bmi, over(female) (running mean on estimation sample) Survey: Mean estimation Number of strata = 24 Number of obs = 7852 Number of PSUs = 48 Population size = 87791137 Design df = 24 0: female = 0 1: female = 1 -------------------------------------------------------------- | Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ bmi | 0 | 25.56599 .0824397 25.39584 25.73614 1 | 25.00118 .1103948 24.77333 25.22902 -------------------------------------------------------------- . estat sd 0: female = 0 1: female = 1 ------------------------------------- Over | Mean Std. Dev. -------------+----------------------- bmi | 0 | 25.56599 3.93822 1 | 25.00118 5.363834 ------------------------------------- . svy, subpop(female): mean bmi (running mean on estimation sample) Survey: Mean estimation Number of strata = 24 Number of obs = 7852 Number of PSUs = 48 Population size = 87791137 Subpop. no. obs = 4115 Subpop. size = 45674517 Design df = 24 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ bmi | 25.00118 .1103948 24.77333 25.22902 -------------------------------------------------------------- . estat sd ------------------------------------- | Mean Std. Dev. -------------+----------------------- bmi | 25.00118 5.363834 ------------------------------------- . svy, subpop(if female==1): mean bmi (running mean on estimation sample) Survey: Mean estimation Number of strata = 24 Number of obs = 7952 Number of PSUs = 48 Population size = 89058869 Subpop. no. obs = 4115 Subpop. size = 45674517 Design df = 24 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ bmi | 25.00118 .1103948 24.77333 25.22902 -------------------------------------------------------------- Note: 7 strata omitted because they contain no subpopulation members. . estat sd ------------------------------------- | Mean Std. Dev. -------------+----------------------- bmi | 25.00118 5.396689 -------------------------------------
Code:
. use http://www.stata-press.com/data/r13/nhanes2 . replace bmi=. if _n<2500 (2499 real changes made, 2499 to missing) . svy: mean bmi, over(female) (running mean on estimation sample) Survey: Mean estimation Number of strata = 24 Number of obs = 7852 Number of PSUs = 48 Population size = 87791137 Design df = 24 0: female = 0 1: female = 1 -------------------------------------------------------------- | Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ bmi | 0 | 25.56599 .0824397 25.39584 25.73614 1 | 25.00118 .1103948 24.77333 25.22902 -------------------------------------------------------------- . estat sd 0: female = 0 1: female = 1 ------------------------------------- Over | Mean Std. Dev. -------------+----------------------- bmi | 0 | 25.56599 3.93822 1 | 25.00118 5.363834 ------------------------------------- . svy, subpop(female): mean bmi (running mean on estimation sample) Survey: Mean estimation Number of strata = 24 Number of obs = 7898 Number of PSUs = 48 Population size = 88372231 Subpop. no. obs = 4115 Subpop. size = 45674517 Design df = 24 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ bmi | 25.00118 .1103948 24.77333 25.22902 -------------------------------------------------------------- Note: 7 strata omitted because they contain no subpopulation members. . estat sd ------------------------------------- | Mean Std. Dev. -------------+----------------------- bmi | 25.00118 5.386902 ------------------------------------- . svy, subpop(if female==1): mean bmi (running mean on estimation sample) Survey: Mean estimation Number of strata = 24 Number of obs = 7898 Number of PSUs = 48 Population size = 88372231 Subpop. no. obs = 4115 Subpop. size = 45674517 Design df = 24 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ bmi | 25.00118 .1103948 24.77333 25.22902 -------------------------------------------------------------- Note: 7 strata omitted because they contain no subpopulation members. . estat sd ------------------------------------- | Mean Std. Dev. -------------+----------------------- bmi | 25.00118 5.386902 -------------------------------------
Comment