Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confidence intervals for weighted means: bootstrap, delta-method, or something else?

    Dear all,

    Hopefully, you can help me calculating confidence intervals for weighted means using post-stratification weights.

    I am working with cross-sectional individual-level survey data in Stata 15 on Windows.

    I have calculated post-stratification weights using Nicholas Winter's -survwgt rake- . Now, I would like to calculate weighted means by age and sex based on these post-stratification weights, including confidence intervals. Actually, I have calculated several sets of post-stratification weights (let's call them poststratificationweight_1 and poststratificationweight_2) and would like to compare means based on them. Hence, I would prefer a method that does not require for the data to be collapsed.

    To calculate the weighted means, I simply used the _gwteman package (ssc inst _gwtmean) and typed:

    Code:
    egen x_mean_1 = wtmean(x), by(sex age) weight(poststratificationweight_1)
    egen x_mean_2 = wtmean(x), by(sex age) weight(poststratificationweight_2)
    Yet I do not know how to calculate the responding confidence intervals. Would it be possible to bootstrap even though I want to use weights? If so, how could I do that in Stata? Also, I read that one could use the delta-method. I read up on this method, but don't know how to apply it to my specific problem.

    Thank you for your help!

    Best,
    Stephanie
    Last edited by Stephanie Retz; 06 Aug 2018, 07:40.

  • #2
    Well, if you use the -svy: mean- command, the weighted means as well as other statistics, including the confidence interval are displayed to the Results window, and are also saved in a matrix r(table). Then it's just a question of creating variables for the means and confidence limits and populating them with the results from the appropriate columns of r(table).

    Comment


    • #3
      Thanks Clyde!

      So I would simply type:

      Code:
      ​​​svyset [pweight=poststratificationweight_1]   // I do not have information on PSU for all my countries
      
      svy: mean x, over(sex age)
      
       matrix means1 = e(b)
      
      // something to extract confidence intervals and means based on first weight
      
      svyset, clear
      
      svyset [pweight=poststratificationweight_2]
      
      svy: mean x, over(sex age)
      
      matrix means2 = e(b)
      
      // something to extract confidence intervals and means based on second weight
      
      svyset, clear
      Could you maybe advise me on how I would get the confidence intervals into a variable upper_ci and lower_ci?

      And: I heard that one should use bootstrapping for CIs, if the distribution of the underlying data is not clear. In the state manual for svy it says that the "default method for variance estimation is vce(bootstrap)" (page 78 here https://www.stata.com/manuals13/svy.pdf). So are the CIs I get from - svy: mean - already based on bootstrapping? If so, how can I control the number of bootstrap samples? Or do I have to use - svy bootstrap -?

      Thank you, Stephanie

      Comment


      • #4
        No, e(b) just gives you the means. You want r(table). Here's an example using a Stata web-based data set:

        Code:
        webuse nhanes2, clear
        
        svyset
        
        svy: mean age, over(race)
        matrix M = r(table)
        matrix list M
        
        gen mean = .
        gen lower_ci = .
        gen upper_ci = .
        
        levelsof race, local(races)
        foreach r of local races {
            local racelabel: label (race) `r'
            local rr: colnumb M "age:`racelabel'"
            replace mean  = M[1, `rr'] if race == `r'
            replace lower_ci = M[5, `rr'] if race == `r'
            replace upper_ci = M[6, `rr'] if race == `r'
        }
        (This data set already comes -svyset-. Evidently you have to write your own -svyset- commands, and do the above twice, once for each set of post-stratification weights. Don't forget to use different variable names for the results with each set of post-stratification weights.)

        I think you have misunderstood the Stata 13 manual section you quote. In that particular data set and with that particular -svyset- command, bootstrap resampling weights were specified, so under those circumstances, Stata defaults to bootstrapping for the CI's. But in your data, if you do not have bootstrap resampling weights, you will not get bootstrap sampling by default; in fact you will not be able to get it at all. If you do have bootstrap resampling weights, the number of bootstrap samples is the same as the number of bootstrap resampling weight variables.



        Comment


        • #5
          Thank you for clarifying all of this - one last question: How are the CIs then calculated? Does - svy: mean - assume that my data is nornmally distributed?
          Best,
          Stephanie

          Comment


          • #6
            No, Stata is definitely not assuming normality. You can see that in the results from the code in #4: if you calculate mean + 1.96 standard_error (the result you would get assuming normality) it comes out different from the confidence limits it gives you.

            But I cannot find in the manual any statement of what the method of calculating the standard errors actually is. I imagine it is the delta-method, just because that is the usual default with svy-based estimation. But I'm not completely certain.

            Comment


            • #7
              Thanks Clyde - all your comments were very helpful.

              Comment


              • #8
                Clyde is correct in saying that the default method for calculating survey standard errors is the delta method. In the survey literature (and the manual) the method is called linearization. There is a section in the Survey Manual called "variance estimation", which shows some of the formulas. The important fact is that survey variance estimation, whether by linearization or by replication, is based on variation between primary sampling units (PSUs), not on variation in an assumed theoretical distribution. The name for this approach is "design-based", as opposed to "model-based".
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #9
                  Thank you Steve!

                  Comment


                  • #10
                    I'm trying to create a graph of mean age at first marriage with confidence intervals by sample. I think the code Clyde has provided is just what I need, but I'm having trouble getting it to run.

                    I'm working with survey data that used a two-stage cluster sampling design. Thus, to get correct confidence intervals, I have to use svyset.

                    I replaced the following variables names: "race" with "sample" and "age" with "agefrstmar".

                    The code fails at the line: local rr: colnumb. I get the message "colnumb not allowed."

                    Can someone tell me why my code is failing?

                    Also, can you help me with the graph syntax? I think what I've got may only work on collapsed data.

                    Code:
                    svyset, clear
                    
                    svyset [pw=perweight], psu(idhspsu) strata(idhsstrata)
                    
                    svy: mean agefrstmar, over(sample)
                    matrix M = r(table)
                    matrix list M
                    
                    gen mean = .
                    gen lower_ci = .
                    gen upper_ci = .
                    
                    levelsof sample, local(samples)
                    foreach r of local samples {
                        local samplelabel: label (sample) `r'
                        local rr: colnumb M "agefrstmar:`samplelabel'"
                        replace mean  = M[1, `rr'] if sample == `r'
                        replace lower_ci = M[5, `rr'] if sample == `r'
                        replace upper_ci = M[6, `rr'] if sample == `r'
                    }
                    
                    graph twoway (bar mean sample)(rcap upper_ci lower_ci sample)

                    Comment


                    • #11
                      I cannot reproduce your problem. You did not provide example data that demonstrates it (though I doubt the data is the issue here). So I used one of the Stata on-line survey data sets and renamed some variables to test it:

                      Code:
                      . webuse nhanes2, clear
                      
                      . 
                      . rename age agefrstmar
                      
                      . rename race sample
                      
                      . 
                      . svy: mean agefrstmar, over(sample)
                      (running mean on estimation sample)
                      
                      Survey: Mean estimation
                      
                      Number of strata =      31      Number of obs   =       10,351
                      Number of PSUs   =      62      Population size =  117,157,513
                                                      Design df       =           31
                      
                              White: sample = White
                              Black: sample = Black
                              Other: sample = Other
                      
                      --------------------------------------------------------------
                                   |             Linearized
                              Over |       Mean   Std. Err.     [95% Conf. Interval]
                      -------------+------------------------------------------------
                      agefrstmar   |
                             White |   42.52564   .3203993      41.87218     43.1791
                             Black |   40.21349   .5807288      39.02908    41.39789
                             Other |   40.46675    2.36883      35.63549    45.29801
                      --------------------------------------------------------------
                      
                      . matrix M = r(table)
                      
                      . matrix list M
                      
                      M[9,3]
                               agefrstmar:  agefrstmar:  agefrstmar:
                                    White        Black        Other
                           b    42.525637    40.213487    40.466751
                          se    .32039929    .58072875    2.3688303
                           t      132.727    69.246593     17.08301
                      pvalue    2.790e-44    1.494e-35    2.514e-17
                          ll    41.872179    39.029083     35.63549
                          ul    43.179096    41.397891    45.298013
                          df           31           31           31
                        crit    2.0395134    2.0395134    2.0395134
                       eform            0            0            0
                      
                      . 
                      . gen mean = .
                      (10,351 missing values generated)
                      
                      . gen lower_ci = .
                      (10,351 missing values generated)
                      
                      . gen upper_ci = .
                      (10,351 missing values generated)
                      
                      . 
                      . levelsof sample, local(samples)
                      1 2 3
                      
                      . foreach r of local samples {
                        2.     local samplelabel: label (sample) `r'
                        3.     local rr: colnumb M "agefrstmar:`samplelabel'"
                        4.     replace mean  = M[1, `rr'] if sample == `r'
                        5.     replace lower_ci = M[5, `rr'] if sample == `r'
                        6.     replace upper_ci = M[6, `rr'] if sample == `r'
                        7. }
                      (9,065 real changes made)
                      (9,065 real changes made)
                      (9,065 real changes made)
                      (1,086 real changes made)
                      (1,086 real changes made)
                      (1,086 real changes made)
                      (200 real changes made)
                      (200 real changes made)
                      (200 real changes made)
                      
                      . 
                      . graph twoway (bar mean sample)(rcap upper_ci lower_ci sample)
                      Are you sure that the code you show is exactly what you ran?

                      Another possibility comes to my mind. When you copy code from web-based sources (including Statalist) or word processing documents, sometimes it gets contaminated with non-printing characters that our eyes do not see, but Stata does. Thus what you and I are reading as colnumb Stata may be seeing as something different. If so, deleting that line of code and re-typing it in by hand from the keyboard will solve the problem. (But if this is what has happened, don't be surprised if the same problem pops up again later in the code. The solution will be the same.)

                      Comment


                      • #12
                        Thanks for the response, Clyde.

                        I tried retyping the code--that did not make any difference.

                        Here's what my output looks like:
                        Code:
                        . svyset, clear
                        
                        .
                        . svyset [pw=perweight], psu(idhspsu) strata(idhsstrata)
                        
                              pweight: perweight
                                  VCE: linearized
                          Single unit: missing
                             Strata 1: idhsstrata
                                 SU 1: idhspsu
                                FPC 1: <zero>
                        
                        .
                        . svy: mean agefrstmar, over(sample)
                        (running mean on estimation sample)
                        
                        Survey: Mean estimation
                        
                        Number of strata =     840        Number of obs   =     34,633
                        Number of PSUs   =   2,086        Population size = 35,163.791
                                                          Design df       =      1,246
                        
                            _subpop_1: sample = rwanda 1992
                            _subpop_2: sample = rwanda 2000
                            _subpop_3: sample = rwanda 2005
                            _subpop_4: sample = rwanda 2010
                            _subpop_5: sample = rwanda 2014
                        
                        --------------------------------------------------------------
                                     |             Linearized
                                Over |       Mean   Std. Err.     [95% Conf. Interval]
                        -------------+------------------------------------------------
                        agefrstmar   |
                           _subpop_1 |   19.45068   .0857405      19.28246    19.61889
                           _subpop_2 |   19.84861   .0501684      19.75019    19.94704
                           _subpop_3 |   19.99498   .0487283      19.89938    20.09058
                           _subpop_4 |   20.50262   .0528505      20.39894    20.60631
                           _subpop_5 |   20.92382   .0505139      20.82472    21.02292
                        --------------------------------------------------------------
                        
                        . matrix M = r(table)
                        
                        . matrix list M
                        
                        M[9,5]
                                 agefrstmar:  agefrstmar:  agefrstmar:  agefrstmar:  agefrstmar:
                                  _subpop_1    _subpop_2    _subpop_3    _subpop_4    _subpop_5
                             b    19.450675    19.848611    19.994977    20.502623    20.923818
                            se    .08574054    .05016842    .04872834    .05285046    .05051393
                             t    226.85506    395.63953    410.33569    387.93652    414.21876
                        pvalue            0            0            0            0            0
                            ll    19.282463    19.750188    19.899378    20.398937    20.824717
                            ul    19.618887    19.947035    20.090575    20.606308     21.02292
                            df         1246         1246         1246         1246         1246
                          crit    1.9618697    1.9618697    1.9618697    1.9618697    1.9618697
                         eform            0            0            0            0            0
                        
                        .
                        . gen mean = .
                        (55,461 missing values generated)
                        
                        . gen lower_ci = .
                        (55,461 missing values generated)
                        
                        . gen upper_ci = .
                        (55,461 missing values generated)
                        
                        .
                        . levelsof sample, local(samples)
                        64601 64602 64603 64604 64605
                        
                        . foreach r of local samples {
                          2.     local samplelabel: label (sample) `r'
                          3.         local rr: colnumb M "agefrstmar:`samplelabel'"
                          4.     replace mean  = M[1, `rr'] if sample == `r'
                          5.     replace lower_ci = M[5, `rr'] if sample == `r'
                          6.     replace upper_ci = M[6, `rr'] if sample == `r'
                          7. }
                        colnumb not allowed
                        r(101);
                        Perhaps it does have something to do with the data. One thing I notice is that your output displays the race labels, White, Black, Other, while my output names each sample _subpop_1, etc. Could that be a clue into the problem?

                        I notice my "sample" variable is "long" type, which makes it unique from the other variables in the file, which are mostly "byte" types. Stata won't allow me to convert it to byte, which isn't surprising. Not clear to me why this would matter though. It still has discrete values.

                        Comment


                        • #13
                          Oh, I think I see the problem. Your sample variable is very different from the variable that plays the analogous role in my code, because its labeled values are not legal names. I think the problem here is that Stata is seeing things like
                          Code:
                          local rr: colnumb M "agefrstmar:64601"
                          and instead of telling you that there is no column in the data named agefrstmar:64601 it is giving you a confusing error message that -colunumb- is not allowed (which his not true!)

                          So you need to change the code a bit to accommodate the actual column labeling in your matrix M.

                          So replace everything from -levelsof sample, local(samples)- on with:
                          Code:
                          local ncols: colsof M
                          forvalues rr = 1/`ncols' {
                              replace mean  = M[1, `rr'] if sample == 6640`rr'
                              replace lower_ci = M[5, `rr'] if sample == 6640`rr'
                              replace upper_ci = M[6, `rr'] if sample == 6640`rr'
                          }
                          and I think it will run.

                          Comment


                          • #14
                            Unfortunately, that also did not work. I get essentially the same message, this time that colsof not allowed. Just to check, this is how I revised the code:

                            Code:
                            levelsof sample, local(samples)
                            local ncols: colsof M
                            forvalues rr = 1/`ncols' {
                                replace mean  = M[1, `rr'] if sample == 6640`rr'
                                replace lower_ci = M[5, `rr'] if sample == 6640`rr'
                                replace upper_ci = M[6, `rr'] if sample == 6640`rr'
                            }
                            The value labels for sample are country years, e.g., "Rwanda 1992", "Rwanda 2000". What makes labels illegal? I've tried generating a new sample variable with new labels (no spaces, underscores instead of spaces, etc.) That's not working either. I continue to get _subpop_# in the column headings instead of the value labels.

                            I very much appreciate your help with this.

                            Comment


                            • #15
                              There's nothing illegal about value labels. But "Rwanda 1992" can't be a column label in a matrix because it contains a blank space."

                              I have found the problem--but not the solution. The following code (which changes the last three lines from what was shown in #14, but those changes are after the error message you are getting) runs without error, on my computer as you can see:

                              Code:
                              . about
                              
                              Stata/MP 15.1 for Windows (64-bit x86-64)
                              Revision 21 Mar 2019
                              Copyright 1985-2017 StataCorp LLC
                              
                              Total physical memory:     8269900 KB
                              Available physical memory: 2174316 KB
                              
                              Single-user 2-core Stata perpetual license:
                                     Serial number:  REDACTED
                                       Licensed to:  Clyde Schechter
                                                     Albert Einstein College of Medicine
                              
                              
                              
                              . webuse nhanes2, clear
                              
                              . 
                              . clonevar sample = race
                              
                              . label define sample     1       "Rwanda 1992" ///
                              >                                         2       "Rwanda 2000"   ///
                              >                                         3       "Rwanda 2005"
                              
                              . label values sample sample
                              
                              . tab sample
                              
                                 1=white, |
                                 2=black, |
                                  3=other |      Freq.     Percent        Cum.
                              ------------+-----------------------------------
                              Rwanda 1992 |      9,065       87.58       87.58
                              Rwanda 2000 |      1,086       10.49       98.07
                              Rwanda 2005 |        200        1.93      100.00
                              ------------+-----------------------------------
                                    Total |     10,351      100.00
                              
                              . 
                              . rename age agefrstmar
                              
                              . 
                              . svy: mean agefrstmar, over(sample)
                              (running mean on estimation sample)
                              
                              Survey: Mean estimation
                              
                              Number of strata =      31      Number of obs   =       10,351
                              Number of PSUs   =      62      Population size =  117,157,513
                                                              Design df       =           31
                              
                                  _subpop_1: sample = Rwanda 1992
                                  _subpop_2: sample = Rwanda 2000
                                  _subpop_3: sample = Rwanda 2005
                              
                              --------------------------------------------------------------
                                           |             Linearized
                                      Over |       Mean   Std. Err.     [95% Conf. Interval]
                              -------------+------------------------------------------------
                              agefrstmar   |
                                 _subpop_1 |   42.52564   .3203993      41.87218     43.1791
                                 _subpop_2 |   40.21349   .5807288      39.02908    41.39789
                                 _subpop_3 |   40.46675    2.36883      35.63549    45.29801
                              --------------------------------------------------------------
                              
                              . 
                              . matrix M = r(table)
                              
                              . matrix list M
                              
                              M[9,3]
                                       agefrstmar:  agefrstmar:  agefrstmar:
                                        _subpop_1    _subpop_2    _subpop_3
                                   b    42.525637    40.213487    40.466751
                                  se    .32039929    .58072875    2.3688303
                                   t      132.727    69.246593     17.08301
                              pvalue    2.790e-44    1.494e-35    2.514e-17
                                  ll    41.872179    39.029083     35.63549
                                  ul    43.179096    41.397891    45.298013
                                  df           31           31           31
                                crit    2.0395134    2.0395134    2.0395134
                               eform            0            0            0
                              
                              . 
                              . local ncols: colsof M
                              
                              . gen mean = .
                              (10,351 missing values generated)
                              
                              . gen lower_ci = .
                              (10,351 missing values generated)
                              
                              . gen upper_ci = .
                              (10,351 missing values generated)
                              
                              . forvalues rr = 1/`ncols' {
                                2.     replace mean  = M[1, `rr'] if sample == `rr'
                                3.     replace lower_ci = M[5, `rr'] if sample == `rr'
                                4.     replace upper_ci = M[6, `rr'] if sample == `rr'
                                5. }
                              (9,065 real changes made)
                              (9,065 real changes made)
                              (9,065 real changes made)
                              (1,086 real changes made)
                              (1,086 real changes made)
                              (1,086 real changes made)
                              (200 real changes made)
                              (200 real changes made)
                              (200 real changes made)
                              
                              .
                              But I ported the code to another machine that runs version 13, and I get the same error that you do. Apparently local macro extended functions colsof and colnumb were not implemented back then. I'm sure there was some way to do the same things, but I don't remember what they were. My best guess is that either these macro extended functions existed under other names, or perhaps there were matrix functions that returned these things back then. I suggest you review -help extended_fcn- and -help matrix functions- to see if you can find what the version 13 analogs of these were.

                              Comment

                              Working...
                              X