Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • svy: versus [pw = ] when only pweights are available

    Suppose your data set has pweights but no stratification or clustering. Is there any compelling reason to svyset the data and use svy:, as opposed to just adding [pw = whatever] to each estimation command?

    It seems like the latter will let you do a few more things, e.g. lrtest will work with the force option. I am not sure that is a good thing though. I sometimes wonder if [pw=whatever] is too permissive or if svy: is too strict when only pweights are used.

    Also, I know that with svy: you are supposed to use the subpop option rather than an if qualifier when analyzing subgroups -- so I assume that would be a reason for using svy: rather than using [pw = whatever].

    My own inclination is to svyset and then use svy: just because it is less typing. But, are there stronger arguments, either way, for using one or the other?

    I have a feeling this has been asked before (or worse yet, that I have asked it) but I am not finding an answer easily.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

  • #2
    My understanding is that if the survey design does not involve stratification and clustering, i.e., weights are the only design characteristic, then the subpop option versus if qualifier issue is moot. The issue with subpop is to recognize the full survey design when only a subset of psu's enter the analysis but in this case there are no psu's or strata. See West, B. T., P. A. Berglund, and S. G. Heeringa. 2008. "A closer examination of subpopulation analysis of complex-sample survey data." Stata Journal 8: 520–531.
    Richard T. Campbell
    Emeritus Professor of Biostatistics and Sociology
    University of Illinois at Chicago

    Comment


    • #3
      Thanks Dick. Steve Samuels and Austin Nichols have been nice enough to explain this svy subpop thing to me probably 10 times and I still don't quite get it. But what you say seems reasonable. If so, it seems more and more like the choice between svy: and [pw = whatever] is just a matter of personal preference. Unless there is some other complication I am overlooking.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Here is a simple, if a little silly, example to show that -subpop()- is different than -if-
        even when there are no strata or clusters.

        First we load the auto dataset and use svyset to identify individual observations
        as the PSUs. Thus we are not weighted, not stratified, and not clustered.

        Code:
        . sysuse auto
        (1978 Automobile Data)
        
        . svyset _n
        
              pweight: <none>
                  VCE: linearized
          Single unit: missing
             Strata 1: <one>
                 SU 1: <observations>
                FPC 1: <zero>
        Here is a survey linear regression using an if clause.

        Code:
        . svy: regress mpg turn if for
        (running regress on estimation sample)
        
        Survey: Linear regression
        
        Number of strata   =         1                  Number of obs      =        22
        Number of PSUs     =        22                  Population size    =        22
                                                        Design df          =        21
                                                        F(   1,     21)    =     36.41
                                                        Prob > F           =    0.0000
                                                        R-squared          =    0.3888
        
        ------------------------------------------------------------------------------
                     |             Linearized
                 mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                turn |  -2.746398   .4551706    -6.03   0.000    -3.692977   -1.799819
               _cons |   122.0202   16.05044     7.60   0.000     88.64146    155.3989
        ------------------------------------------------------------------------------
        Now we use the same sub pop identifier within the subpop() option.

        Code:
        . svy, subpop(for) : regress mpg turn
        (running regress on estimation sample)
        
        Survey: Linear regression
        
        Number of strata   =         1                  Number of obs      =        74
        Number of PSUs     =        74                  Population size    =        74
                                                        Subpop. no. of obs =        22
                                                        Subpop. size       =        22
                                                        Design df          =        73
                                                        F(   1,     73)    =     37.62
                                                        Prob > F           =    0.0000
                                                        R-squared          =    0.3888
        
        ------------------------------------------------------------------------------
                     |             Linearized
                 mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                turn |  -2.746398   .4477411    -6.13   0.000    -3.638744   -1.854051
               _cons |   122.0202   15.78846     7.73   0.000     90.55382    153.4865
        ------------------------------------------------------------------------------
        Notice that the point estimates are the same, as we expect.

        The standard errors are also different. This difference affects the value
        of the test statistic.

        The design degrees of freedom are also different. This is a big deal
        because this can have a big effect on the p-value and confidence limits
        in addition to the effect of the different standard errors.

        These two scenarios do not yield the same results, so choose your analysis with
        this in mind.

        Comment


        • #5
          Thanks Jeff. But this is making my head hurt again. The way I would normally do this is

          Code:
          . regress mpg turn if foreign
          
                Source |       SS       df       MS              Number of obs =      22
          -------------+------------------------------           F(  1,    20) =   12.72
                 Model |  356.906864     1  356.906864           Prob > F      =  0.0019
              Residual |  560.956772    20  28.0478386           R-squared     =  0.3888
          -------------+------------------------------           Adj R-squared =  0.3583
                 Total |  917.863636    21  43.7077922           Root MSE      =   5.296
          
          ------------------------------------------------------------------------------
                   mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                  turn |  -2.746398   .7699024    -3.57   0.002    -4.352386   -1.140409
                 _cons |   122.0202   27.28492     4.47   0.000     65.10483    178.9355
          ------------------------------------------------------------------------------
          But that doesn't match up with either svy command you gave. Although I do come close but not quite with

          Code:
          . regress mpg turn if foreign, vce(robust)
          
          Linear regression                                      Number of obs =      22
                                                                 F(  1,    20) =   34.67
                                                                 Prob > F      =  0.0000
                                                                 R-squared     =  0.3888
                                                                 Root MSE      =   5.296
          
          ------------------------------------------------------------------------------
                       |               Robust
                   mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                  turn |  -2.746398   .4664111    -5.89   0.000    -3.719314   -1.773481
                 _cons |   122.0202   16.44681     7.42   0.000     87.71274    156.3276
          ------------------------------------------------------------------------------
          Anyway, I am not sure what to do. If I have svyset the data, then I am still semi-confident that I should use subpop rather than if for subsample selection. But I am not sure what the answer is to my original question about using [pw = whatever] vs svy:
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          Stata Version: 17.0 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            Maybe this helps a bit - below is how to replicate the survey linear regression using suest. I think the reason straight regress,rob doesn't work is that it automatically uses the standard OLS small-sample dof adjustment.

            Code:
            . svy: regress mpg turn if foreign
            (running regress on estimation sample)
            
            Survey: Linear regression
            
            Number of strata   =         1                  Number of obs      =        22
            Number of PSUs     =        22                  Population size    =        22
                                                            Design df          =        21
                                                            F(   1,     21)    =     36.41
                                                            Prob > F           =    0.0000
                                                            R-squared          =    0.3888
            
            ------------------------------------------------------------------------------
                         |             Linearized
                     mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                    turn |  -2.746398   .4551706    -6.03   0.000    -3.692977   -1.799819
                   _cons |   122.0202   16.05044     7.60   0.000     88.64146    155.3989
            ------------------------------------------------------------------------------

            Code:
            . qui regress mpg turn if foreign
            
            . suest .
            
            Robust results for .                              Number of obs   =         22
            
            ------------------------------------------------------------------------------
                         |               Robust
                         |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            mean         |
                    turn |  -2.746398   .4551706    -6.03   0.000    -3.638516    -1.85428
                   _cons |   122.0202   16.05044     7.60   0.000     90.56189    153.4785
            -------------+----------------------------------------------------------------
            lnvar        |
                   _cons |   3.333912   .4099026     8.13   0.000     2.530517    4.137306
            ------------------------------------------------------------------------------

            Comment


            • #7
              Actually, maybe this is more transparent. In both cases, taking the regress,rob SE and multiplying by sqrt(20/22) removes the aforementioned small-sample dof adjustment. The first example replicates the SE of the survey linear regression; the second replicates the SE using the subpop() option.

              Code:
              . qui svy: regress mpg turn if foreign
              
              . 
              . di _se[turn]
              .45517062
              
              . 
              . qui regress mpg turn if foreign, rob
              
              . di _se[turn] * sqrt(20/22) * sqrt(22/21)
              .45517062

              Code:
              . qui svy, subpop(for) : regress mpg turn
              
              . 
              . di _se[turn]
              .44774109
              
              . 
              . qui regress mpg turn if foreign, rob
              
              . di _se[turn] * sqrt(20/22) * sqrt(74/73)
              .44774109

              Comment


              • #8
                Thanks Mark. For what I assume are similar reasons, glm also will produce the same results:

                Code:
                . glm mpg turn if foreign, vce(robust)
                
                Iteration 0:   log pseudolikelihood = -66.841263  
                
                Generalized linear models                          No. of obs      =        22
                Optimization     : ML                              Residual df     =        20
                                                                   Scale parameter =  28.04784
                Deviance         =  560.9567723                    (1/df) Deviance =  28.04784
                Pearson          =  560.9567723                    (1/df) Pearson  =  28.04784
                
                Variance function: V(u) = 1                        [Gaussian]
                Link function    : g(u) = u                        [Identity]
                
                                                                   AIC             =  6.258297
                Log pseudolikelihood = -66.84126307                BIC             =  499.1359
                
                ------------------------------------------------------------------------------
                             |               Robust
                         mpg |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                        turn |  -2.746398   .4551706    -6.03   0.000    -3.638516    -1.85428
                       _cons |   122.0202   16.05044     7.60   0.000     90.56189    153.4785
                ------------------------------------------------------------------------------
                
                . test turn
                
                 ( 1)  [mpg]turn = 0
                
                           chi2(  1) =   36.41
                         Prob > chi2 =    0.0000
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                Stata Version: 17.0 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  To address Richard's original question

                  If I have svyset the data, then I am still semi-confident that I should use subpop rather than if for subsample selection.
                  But I am not sure what the answer is to my original question about using [pw = whatever] vs svy:
                  In the remarks section of the manual entry [SVY] say estimation there is an
                  overview of survey analysis that details what it means when you use the
                  svy prefix. For example, it details the difference between regress
                  and svy: regress. See page 92 of the survey manual in Stata 13.


                  Subpopulation estimation in a nut shell:

                  The subpop() option is there to handle the common occurrence of subpopulation estimation for survey data.
                  The basic principle for subpop estimation is that the survey design is fixed, so in order to account for sample
                  to sample variation, we must assume that future samples will be taken using the current survey design. So even
                  if the overall sample size is fixed in repeated samples, this method accounts for the fact that the subpop sample
                  size is now a random entity.

                  With the if clause, we assume that we can ignore the out-of-subpop units in order to estimate the standard
                  errors. This breaks the assumption that the survey design is fixed. A fundamental consequence of using this
                  method is that you are fixing the subpop sample size for variance estimation.

                  Comment


                  • #10
                    Thanks to everybody for this very interesting discussion! Just to make sure, I understand correctly:

                    I thought, the basic reason for using subpop was this: "If the data set is subset, meaning that observations not to be included in the subpopulation are deleted from the data set, the standard errors of the estimates cannot be calculated correctly. When the subpopulation option(s) is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors." (source: https://stats.idre.ucla.edu/stata/fa...data-in-stata/ )
                    I have to admit, I'm not a hundred percent sure what "cannot be calculated correctly" amounts to, but I guess the above regression are good examples.

                    So, how does that tie in with what Jeff writes? I.e. that this method "accounts for the fact that the subpop sample size is now a random entity" and that a "fundamental consequence of using this method [the if clause] is that you are fixing the subpop sample size for variance estimation"? I am just not a hundred percent sure as to what that means with regard to survey data. (Sorry if this is basic, would appreciate some help/explanation nonetheless.)

                    And, Mark, it was really interesting to see how you reproduced the svy subpop results with
                    Code:
                     qui regress mpg turn if foreign, rob
                    . di _se[turn] * sqrt(20/22) * sqrt(74/73)
                    .44774109
                    Am I right in assuming that the 20 are the dof of the simple regression, whereas 22 is the number of PSUs of the svy subpop regression? If yes, can somebody give me a quick clue as to what statistical rationale is there behind using sqrt(20/22) and sqrt(74/73) to be able to reproduce the svy subpop results?

                    Comment


                    • #11
                      Maybe another interesting difference for multilevel models is sth Steve Samuels says in this other discussion:

                      Originally posted by Steve Samuels View Post
                      ... after svyset, Stata expects you to include the PSU as a level in the multilevel model and to specify the PSU design weight (1/(selection probability of the PSU).
                      If [pweight=some weight] can be used in multilevel models without specifying a further level for PSUs, that would be quite a reduction of complexity. But I might be just as well mistaken!

                      Comment


                      • #12
                        Dear all, Thank you very much! I have learned a lot from your daily posts.
                        With Best Wishes,Hassen

                        Comment


                        • #13
                          Hi Everyone,

                          I'm really late to the party on this one, but I had a similar question to what Richard initially asked:

                          Is there any compelling reason to svyset the data and use svy:, as opposed to just adding [pw = whatever] to each estimation command
                          With respect to the Canadian Community Health Survey, Statistics Canada has a complex weighting system that folds all information about strata, cluster, primary sampling unit, etc. into a single person-level master weight <wts_m>. This is the only weighting information provided that is meant to "debias" the eventual estimates. When using svy:, there is a slight change in the df relative to the regress with weights option; however, Stata is now assuming that the number of PSUs is equal to the sample size, which is extremely wrong.

                          Code:
                          . svy: regress inc sex age-dmar3 
                          (running regress on estimation sample)
                          
                          Survey: Linear regression
                          
                          Number of strata   =         1                 Number of obs     =      22,929
                          Number of PSUs     =    22,929                 Population size   =  25,858,251
                                                                         Design df         =      22,928
                                                                         F(   5,  22924)   =      174.08
                                                                         Prob > F          =      0.0000
                                                                         R-squared         =      0.1086
                          
                          ------------------------------------------------------------------------------
                                       |             Linearized
                                   inc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                          -------------+----------------------------------------------------------------
                                   sex |   .4384666   .0637057     6.88   0.000     .3135992     .563334
                                   age |  -.1105906    .009394   -11.77   0.000    -.1290034   -.0921777
                              minority |  -1.517119   .0874858   -17.34   0.000    -1.688598   -1.345641
                                 dmar2 |  -1.395568   .0983552   -14.19   0.000    -1.588351   -1.202785
                                 dmar3 |  -1.331934   .0839483   -15.87   0.000    -1.496478    -1.16739
                                 _cons |   6.855946   .0868685    78.92   0.000     6.685678    7.026214
                          ------------------------------------------------------------------------------
                          
                          . regress inc sex age-dmar3 [pw=wts_m]
                          (sum of wgt is 25,858,250.63)
                          
                          Linear regression                               Number of obs     =     22,929
                                                                          F(5, 22923)       =     174.07
                                                                          Prob > F          =     0.0000
                                                                          R-squared         =     0.1086
                                                                          Root MSE          =     2.7095
                          
                          ------------------------------------------------------------------------------
                                       |               Robust
                                   inc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                          -------------+----------------------------------------------------------------
                                   sex |   .4384666   .0637126     6.88   0.000     .3135856    .5633476
                                   age |  -.1105906    .009395   -11.77   0.000    -.1290054   -.0921757
                              minority |  -1.517119   .0874953   -17.34   0.000    -1.688616   -1.345623
                                 dmar2 |  -1.395568    .098366   -14.19   0.000    -1.588372   -1.202764
                                 dmar3 |  -1.331934   .0839574   -15.86   0.000    -1.496496   -1.167372
                                 _cons |   6.855946    .086878    78.91   0.000     6.685659    7.026233
                          ------------------------------------------------------------------------------
                          As noted above in this thread already, the "best practices" question is complicated when discussing subsample selections because "if" vs. "subpop()" produces different model estimates.
                          1. Is svyset necessary, or is only specifying the pw option acceptable with an entire sample?
                          2. Is svyset necessary, or is only specifying the pw option acceptable when looking at subsamples?
                            • Large datasets always have at least some missing data, those observations are automatically excluded with <regress>, will their exclusion create an artificial subsample within a model, even though no subsample/if option was selected?
                          Basically, if you use svyset: Stata will make incorrect assumptions about the structure of your data, and if you don't svyset: it messes up subsample estimates, so I'm not sure how to proceed.

                          Thanks everyone!

                          Cheers,

                          David.

                          Comment

                          Working...
                          X