Confidence intervals for weighted means: bootstrap, delta-method, or something else?

Stephanie Retz

Join Date: Feb 2018

Posts: 37
#1

Confidence intervals for weighted means: bootstrap, delta-method, or something else?

06 Aug 2018, 07:36

Dear all,

Hopefully, you can help me calculating confidence intervals for weighted means using post-stratification weights.

I am working with cross-sectional individual-level survey data in Stata 15 on Windows.

I have calculated post-stratification weights using Nicholas Winter's -survwgt rake- . Now, I would like to calculate weighted means by age and sex based on these post-stratification weights, including confidence intervals. Actually, I have calculated several sets of post-stratification weights (let's call them poststratificationweight_1 and poststratificationweight_2) and would like to compare means based on them. Hence, I would prefer a method that does not require for the data to be collapsed.

To calculate the weighted means, I simply used the _gwteman package (ssc inst _gwtmean) and typed:

Code:

egen x_mean_1 = wtmean(x), by(sex age) weight(poststratificationweight_1) egen x_mean_2 = wtmean(x), by(sex age) weight(poststratificationweight_2)

Yet I do not know how to calculate the responding confidence intervals. Would it be possible to bootstrap even though I want to use weights? If so, how could I do that in Stata? Also, I read that one could use the delta-method. I read up on this method, but don't know how to apply it to my specific problem.

Thank you for your help!

Best,
Stephanie

Last edited by Stephanie Retz; 06 Aug 2018, 07:40.
Tags: bootstrap, ci, weights
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#2

06 Aug 2018, 10:28

Well, if you use the -svy: mean- command, the weighted means as well as other statistics, including the confidence interval are displayed to the Results window, and are also saved in a matrix r(table). Then it's just a question of creating variables for the means and confidence limits and populating them with the results from the appropriate columns of r(table).
Comment
Stephanie Retz

Join Date: Feb 2018

Posts: 37
#3

06 Aug 2018, 13:30

Thanks Clyde!

So I would simply type:

Code:

svyset [pweight=poststratificationweight_1] // I do not have information on PSU for all my countries svy: mean x, over(sex age) matrix means1 = e(b) // something to extract confidence intervals and means based on first weight svyset, clear svyset [pweight=poststratificationweight_2] svy: mean x, over(sex age) matrix means2 = e(b) // something to extract confidence intervals and means based on second weight svyset, clear

Could you maybe advise me on how I would get the confidence intervals into a variable upper_ci and lower_ci?

And: I heard that one should use bootstrapping for CIs, if the distribution of the underlying data is not clear. In the state manual for svy it says that the "default method for variance estimation is vce(bootstrap)" (page 78 here https://www.stata.com/manuals13/svy.pdf). So are the CIs I get from - svy: mean - already based on bootstrapping? If so, how can I control the number of bootstrap samples? Or do I have to use - svy bootstrap -?

Thank you, Stephanie
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#4

06 Aug 2018, 13:56

No, e(b) just gives you the means. You want r(table). Here's an example using a Stata web-based data set:

Code:

webuse nhanes2, clear svyset svy: mean age, over(race) matrix M = r(table) matrix list M gen mean = . gen lower_ci = . gen upper_ci = . levelsof race, local(races) foreach r of local races { local racelabel: label (race) `r' local rr: colnumb M "age:`racelabel'" replace mean = M[1, `rr'] if race == `r' replace lower_ci = M[5, `rr'] if race == `r' replace upper_ci = M[6, `rr'] if race == `r' }

(This data set already comes -svyset-. Evidently you have to write your own -svyset- commands, and do the above twice, once for each set of post-stratification weights. Don't forget to use different variable names for the results with each set of post-stratification weights.)

I think you have misunderstood the Stata 13 manual section you quote. In that particular data set and with that particular -svyset- command, bootstrap resampling weights were specified, so under those circumstances, Stata defaults to bootstrapping for the CI's. But in your data, if you do not have bootstrap resampling weights, you will not get bootstrap sampling by default; in fact you will not be able to get it at all. If you do have bootstrap resampling weights, the number of bootstrap samples is the same as the number of bootstrap resampling weight variables.
Comment
Stephanie Retz

Join Date: Feb 2018

Posts: 37
#5

06 Aug 2018, 14:19

Thank you for clarifying all of this - one last question: How are the CIs then calculated? Does - svy: mean - assume that my data is nornmally distributed?
Best,
Stephanie
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#6

06 Aug 2018, 14:26

No, Stata is definitely not assuming normality. You can see that in the results from the code in #4: if you calculate mean + 1.96 standard_error (the result you would get assuming normality) it comes out different from the confidence limits it gives you.

But I cannot find in the manual any statement of what the method of calculating the standard errors actually is. I imagine it is the delta-method, just because that is the usual default with svy-based estimation. But I'm not completely certain.
Comment
Stephanie Retz

Join Date: Feb 2018

Posts: 37
#7

07 Aug 2018, 00:14

Thanks Clyde - all your comments were very helpful.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

07 Aug 2018, 14:17

Clyde is correct in saying that the default method for calculating survey standard errors is the delta method. In the survey literature (and the manual) the method is called linearization. There is a section in the Survey Manual called "variance estimation", which shows some of the formulas. The important fact is that survey variance estimation, whether by linearization or by replication, is based on variation between primary sampling units (PSUs), not on variation in an assumed theoretical distribution. The name for this approach is "design-based", as opposed to "model-based".

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Stephanie Retz

Join Date: Feb 2018

Posts: 37
#9

08 Aug 2018, 02:04

Thank you Steve!
Comment
Elizabeth Boyle

Join Date: Jun 2019

Posts: 6
#10

11 Jun 2019, 10:30

I'm trying to create a graph of mean age at first marriage with confidence intervals by sample. I think the code Clyde has provided is just what I need, but I'm having trouble getting it to run.

I'm working with survey data that used a two-stage cluster sampling design. Thus, to get correct confidence intervals, I have to use svyset.

I replaced the following variables names: "race" with "sample" and "age" with "agefrstmar".

The code fails at the line: local rr: colnumb. I get the message "colnumb not allowed."

Can someone tell me why my code is failing?

Also, can you help me with the graph syntax? I think what I've got may only work on collapsed data.

Code:

svyset, clear svyset [pw=perweight], psu(idhspsu) strata(idhsstrata) svy: mean agefrstmar, over(sample) matrix M = r(table) matrix list M gen mean = . gen lower_ci = . gen upper_ci = . levelsof sample, local(samples) foreach r of local samples { local samplelabel: label (sample) `r' local rr: colnumb M "agefrstmar:`samplelabel'" replace mean = M[1, `rr'] if sample == `r' replace lower_ci = M[5, `rr'] if sample == `r' replace upper_ci = M[6, `rr'] if sample == `r' } graph twoway (bar mean sample)(rcap upper_ci lower_ci sample)
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30121

#11

11 Jun 2019, 10:50

I cannot reproduce your problem. You did not provide example data that demonstrates it (though I doubt the data is the issue here). So I used one of the Stata on-line survey data sets and renamed some variables to test it:

Code:

. webuse nhanes2, clear

. 
. rename age agefrstmar

. rename race sample

. 
. svy: mean agefrstmar, over(sample)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      31      Number of obs   =       10,351
Number of PSUs   =      62      Population size =  117,157,513
                                Design df       =           31

        White: sample = White
        Black: sample = Black
        Other: sample = Other

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
agefrstmar   |
       White |   42.52564   .3203993      41.87218     43.1791
       Black |   40.21349   .5807288      39.02908    41.39789
       Other |   40.46675    2.36883      35.63549    45.29801
--------------------------------------------------------------

. matrix M = r(table)

. matrix list M

M[9,3]
         agefrstmar:  agefrstmar:  agefrstmar:
              White        Black        Other
     b    42.525637    40.213487    40.466751
    se    .32039929    .58072875    2.3688303
     t      132.727    69.246593     17.08301
pvalue    2.790e-44    1.494e-35    2.514e-17
    ll    41.872179    39.029083     35.63549
    ul    43.179096    41.397891    45.298013
    df           31           31           31
  crit    2.0395134    2.0395134    2.0395134
 eform            0            0            0

. 
. gen mean = .
(10,351 missing values generated)

. gen lower_ci = .
(10,351 missing values generated)

. gen upper_ci = .
(10,351 missing values generated)

. 
. levelsof sample, local(samples)
1 2 3

. foreach r of local samples {
  2.     local samplelabel: label (sample) `r'
  3.     local rr: colnumb M "agefrstmar:`samplelabel'"
  4.     replace mean  = M[1, `rr'] if sample == `r'
  5.     replace lower_ci = M[5, `rr'] if sample == `r'
  6.     replace upper_ci = M[6, `rr'] if sample == `r'
  7. }
(9,065 real changes made)
(9,065 real changes made)
(9,065 real changes made)
(1,086 real changes made)
(1,086 real changes made)
(1,086 real changes made)
(200 real changes made)
(200 real changes made)
(200 real changes made)

. 
. graph twoway (bar mean sample)(rcap upper_ci lower_ci sample)

Are you sure that the code you show is exactly what you ran?

Another possibility comes to my mind. When you copy code from web-based sources (including Statalist) or word processing documents, sometimes it gets contaminated with non-printing characters that our eyes do not see, but Stata does. Thus what you and I are reading as colnumb Stata may be seeing as something different. If so, deleting that line of code and re-typing it in by hand from the keyboard will solve the problem. (But if this is what has happened, don't be surprised if the same problem pops up again later in the code. The solution will be the same.)

Comment

Elizabeth Boyle

Join Date: Jun 2019
Posts: 6

#12

11 Jun 2019, 14:53

Thanks for the response, Clyde.

I tried retyping the code--that did not make any difference.

Here's what my output looks like:

Code:

. svyset, clear

.
. svyset [pw=perweight], psu(idhspsu) strata(idhsstrata)

      pweight: perweight
          VCE: linearized
  Single unit: missing
     Strata 1: idhsstrata
         SU 1: idhspsu
        FPC 1: <zero>

.
. svy: mean agefrstmar, over(sample)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =     840        Number of obs   =     34,633
Number of PSUs   =   2,086        Population size = 35,163.791
                                  Design df       =      1,246

    _subpop_1: sample = rwanda 1992
    _subpop_2: sample = rwanda 2000
    _subpop_3: sample = rwanda 2005
    _subpop_4: sample = rwanda 2010
    _subpop_5: sample = rwanda 2014

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
agefrstmar   |
   _subpop_1 |   19.45068   .0857405      19.28246    19.61889
   _subpop_2 |   19.84861   .0501684      19.75019    19.94704
   _subpop_3 |   19.99498   .0487283      19.89938    20.09058
   _subpop_4 |   20.50262   .0528505      20.39894    20.60631
   _subpop_5 |   20.92382   .0505139      20.82472    21.02292
--------------------------------------------------------------

. matrix M = r(table)

. matrix list M

M[9,5]
         agefrstmar:  agefrstmar:  agefrstmar:  agefrstmar:  agefrstmar:
          _subpop_1    _subpop_2    _subpop_3    _subpop_4    _subpop_5
     b    19.450675    19.848611    19.994977    20.502623    20.923818
    se    .08574054    .05016842    .04872834    .05285046    .05051393
     t    226.85506    395.63953    410.33569    387.93652    414.21876
pvalue            0            0            0            0            0
    ll    19.282463    19.750188    19.899378    20.398937    20.824717
    ul    19.618887    19.947035    20.090575    20.606308     21.02292
    df         1246         1246         1246         1246         1246
  crit    1.9618697    1.9618697    1.9618697    1.9618697    1.9618697
 eform            0            0            0            0            0

.
. gen mean = .
(55,461 missing values generated)

. gen lower_ci = .
(55,461 missing values generated)

. gen upper_ci = .
(55,461 missing values generated)

.
. levelsof sample, local(samples)
64601 64602 64603 64604 64605

. foreach r of local samples {
  2.     local samplelabel: label (sample) `r'
  3.         local rr: colnumb M "agefrstmar:`samplelabel'"
  4.     replace mean  = M[1, `rr'] if sample == `r'
  5.     replace lower_ci = M[5, `rr'] if sample == `r'
  6.     replace upper_ci = M[6, `rr'] if sample == `r'
  7. }
colnumb not allowed
r(101);

Perhaps it does have something to do with the data. One thing I notice is that your output displays the race labels, White, Black, Other, while my output names each sample _subpop_1, etc. Could that be a clue into the problem?

I notice my "sample" variable is "long" type, which makes it unique from the other variables in the file, which are mostly "byte" types. Stata won't allow me to convert it to byte, which isn't surprising. Not clear to me why this would matter though. It still has discrete values.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#13

11 Jun 2019, 17:38

Oh, I think I see the problem. Your sample variable is very different from the variable that plays the analogous role in my code, because its labeled values are not legal names. I think the problem here is that Stata is seeing things like

Code:

local rr: colnumb M "agefrstmar:64601"

and instead of telling you that there is no column in the data named agefrstmar:64601 it is giving you a confusing error message that -colunumb- is not allowed (which his not true!)

So you need to change the code a bit to accommodate the actual column labeling in your matrix M.

So replace everything from -levelsof sample, local(samples)- on with:

Code:

local ncols: colsof M forvalues rr = 1/`ncols' { replace mean = M[1, `rr'] if sample == 6640`rr' replace lower_ci = M[5, `rr'] if sample == 6640`rr' replace upper_ci = M[6, `rr'] if sample == 6640`rr' }

and I think it will run.
Comment
Elizabeth Boyle

Join Date: Jun 2019

Posts: 6
#14

12 Jun 2019, 11:10

Unfortunately, that also did not work. I get essentially the same message, this time that colsof not allowed. Just to check, this is how I revised the code:

Code:

levelsof sample, local(samples) local ncols: colsof M forvalues rr = 1/`ncols' { replace mean = M[1, `rr'] if sample == 6640`rr' replace lower_ci = M[5, `rr'] if sample == 6640`rr' replace upper_ci = M[6, `rr'] if sample == 6640`rr' }

The value labels for sample are country years, e.g., "Rwanda 1992", "Rwanda 2000". What makes labels illegal? I've tried generating a new sample variable with new labels (no spaces, underscores instead of spaces, etc.) That's not working either. I continue to get _subpop_# in the column headings instead of the value labels.

I very much appreciate your help with this.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30121

#15

12 Jun 2019, 12:38

There's nothing illegal about value labels. But "Rwanda 1992" can't be a column label in a matrix because it contains a blank space."

I have found the problem--but not the solution. The following code (which changes the last three lines from what was shown in #14, but those changes are after the error message you are getting) runs without error, on my computer as you can see:

Code:

. about

Stata/MP 15.1 for Windows (64-bit x86-64)
Revision 21 Mar 2019
Copyright 1985-2017 StataCorp LLC

Total physical memory:     8269900 KB
Available physical memory: 2174316 KB

Single-user 2-core Stata perpetual license:
       Serial number:  REDACTED
         Licensed to:  Clyde Schechter
                       Albert Einstein College of Medicine



. webuse nhanes2, clear

. 
. clonevar sample = race

. label define sample     1       "Rwanda 1992" ///
>                                         2       "Rwanda 2000"   ///
>                                         3       "Rwanda 2005"

. label values sample sample

. tab sample

   1=white, |
   2=black, |
    3=other |      Freq.     Percent        Cum.
------------+-----------------------------------
Rwanda 1992 |      9,065       87.58       87.58
Rwanda 2000 |      1,086       10.49       98.07
Rwanda 2005 |        200        1.93      100.00
------------+-----------------------------------
      Total |     10,351      100.00

. 
. rename age agefrstmar

. 
. svy: mean agefrstmar, over(sample)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      31      Number of obs   =       10,351
Number of PSUs   =      62      Population size =  117,157,513
                                Design df       =           31

    _subpop_1: sample = Rwanda 1992
    _subpop_2: sample = Rwanda 2000
    _subpop_3: sample = Rwanda 2005

--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
agefrstmar   |
   _subpop_1 |   42.52564   .3203993      41.87218     43.1791
   _subpop_2 |   40.21349   .5807288      39.02908    41.39789
   _subpop_3 |   40.46675    2.36883      35.63549    45.29801
--------------------------------------------------------------

. 
. matrix M = r(table)

. matrix list M

M[9,3]
         agefrstmar:  agefrstmar:  agefrstmar:
          _subpop_1    _subpop_2    _subpop_3
     b    42.525637    40.213487    40.466751
    se    .32039929    .58072875    2.3688303
     t      132.727    69.246593     17.08301
pvalue    2.790e-44    1.494e-35    2.514e-17
    ll    41.872179    39.029083     35.63549
    ul    43.179096    41.397891    45.298013
    df           31           31           31
  crit    2.0395134    2.0395134    2.0395134
 eform            0            0            0

. 
. local ncols: colsof M

. gen mean = .
(10,351 missing values generated)

. gen lower_ci = .
(10,351 missing values generated)

. gen upper_ci = .
(10,351 missing values generated)

. forvalues rr = 1/`ncols' {
  2.     replace mean  = M[1, `rr'] if sample == `rr'
  3.     replace lower_ci = M[5, `rr'] if sample == `rr'
  4.     replace upper_ci = M[6, `rr'] if sample == `rr'
  5. }
(9,065 real changes made)
(9,065 real changes made)
(9,065 real changes made)
(1,086 real changes made)
(1,086 real changes made)
(1,086 real changes made)
(200 real changes made)
(200 real changes made)
(200 real changes made)

.

But I ported the code to another machine that runs version 13, and I get the same error that you do. Apparently local macro extended functions colsof and colnumb were not implemented back then. I'm sure there was some way to do the same things, but I don't remember what they were. My best guess is that either these macro extended functions existed under other names, or perhaps there were matrix functions that returned these things back then. I suggest you review -help extended_fcn- and -help matrix functions- to see if you can find what the version 13 analogs of these were.

Announcement