svy: versus [pw = ] when only pweights are available

Richard Williams

Join Date: Apr 2014

Posts: 5008
#1

svy: versus [pw = ] when only pweights are available

19 Feb 2015, 11:11

Suppose your data set has pweights but no stratification or clustering. Is there any compelling reason to svyset the data and use svy:, as opposed to just adding [pw = whatever] to each estimation command?

It seems like the latter will let you do a few more things, e.g. lrtest will work with the force option. I am not sure that is a good thing though. I sometimes wonder if [pw=whatever] is too permissive or if svy: is too strict when only pweights are used.

Also, I know that with svy: you are supposed to use the subpop option rather than an if qualifier when analyzing subgroups -- so I assume that would be a reason for using svy: rather than using [pw = whatever].

My own inclination is to svyset and then use svy: just because it is less typing. But, are there stronger arguments, either way, for using one or the other?

I have a feeling this has been asked before (or worse yet, that I have asked it) but I am not finding an answer easily.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Tags: None
Dick Campbell

Join Date: Apr 2014

Posts: 279
#2

19 Feb 2015, 16:50

My understanding is that if the survey design does not involve stratification and clustering, i.e., weights are the only design characteristic, then the subpop option versus if qualifier issue is moot. The issue with subpop is to recognize the full survey design when only a subset of psu's enter the analysis but in this case there are no psu's or strata. See West, B. T., P. A. Berglund, and S. G. Heeringa. 2008. "A closer examination of subpopulation analysis of complex-sample survey data." Stata Journal 8: 520–531.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
2 likes
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#3

19 Feb 2015, 17:41

Thanks Dick. Steve Samuels and Austin Nichols have been nice enough to explain this svy subpop thing to me probably 10 times and I still don't quite get it. But what you say seems reasonable. If so, it seems more and more like the choice between svy: and [pw = whatever] is just a matter of personal preference. Unless there is some other complication I am overlooking.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014
Posts: 701

19 Feb 2015, 21:28

Here is a simple, if a little silly, example to show that -subpop()- is different than -if-
even when there are no strata or clusters.

First we load the auto dataset and use svyset to identify individual observations
as the PSUs. Thus we are not weighted, not stratified, and not clustered.

Code:

. sysuse auto
(1978 Automobile Data)

. svyset _n

      pweight: <none>
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

Here is a survey linear regression using an if clause.

Code:

. svy: regress mpg turn if for
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs      =        22
Number of PSUs     =        22                  Population size    =        22
                                                Design df          =        21
                                                F(   1,     21)    =     36.41
                                                Prob > F           =    0.0000
                                                R-squared          =    0.3888

------------------------------------------------------------------------------
             |             Linearized
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |  -2.746398   .4551706    -6.03   0.000    -3.692977   -1.799819
       _cons |   122.0202   16.05044     7.60   0.000     88.64146    155.3989
------------------------------------------------------------------------------

Now we use the same sub pop identifier within the subpop() option.

Code:

. svy, subpop(for) : regress mpg turn
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs      =        74
Number of PSUs     =        74                  Population size    =        74
                                                Subpop. no. of obs =        22
                                                Subpop. size       =        22
                                                Design df          =        73
                                                F(   1,     73)    =     37.62
                                                Prob > F           =    0.0000
                                                R-squared          =    0.3888

------------------------------------------------------------------------------
             |             Linearized
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |  -2.746398   .4477411    -6.13   0.000    -3.638744   -1.854051
       _cons |   122.0202   15.78846     7.73   0.000     90.55382    153.4865
------------------------------------------------------------------------------

Notice that the point estimates are the same, as we expect.

The standard errors are also different. This difference affects the value
of the test statistic.

The design degrees of freedom are also different. This is a big deal
because this can have a big effect on the p-value and confidence limits
in addition to the effect of the different standard errors.

These two scenarios do not yield the same results, so choose your analysis with
this in mind.

Comment

Richard Williams

Join Date: Apr 2014
Posts: 5008

19 Feb 2015, 22:13

Thanks Jeff. But this is making my head hurt again. The way I would normally do this is

Code:

. regress mpg turn if foreign

      Source |       SS       df       MS              Number of obs =      22
-------------+------------------------------           F(  1,    20) =   12.72
       Model |  356.906864     1  356.906864           Prob > F      =  0.0019
    Residual |  560.956772    20  28.0478386           R-squared     =  0.3888
-------------+------------------------------           Adj R-squared =  0.3583
       Total |  917.863636    21  43.7077922           Root MSE      =   5.296

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |  -2.746398   .7699024    -3.57   0.002    -4.352386   -1.140409
       _cons |   122.0202   27.28492     4.47   0.000     65.10483    178.9355
------------------------------------------------------------------------------

But that doesn't match up with either svy command you gave. Although I do come close but not quite with

Code:

. regress mpg turn if foreign, vce(robust)

Linear regression                                      Number of obs =      22
                                                       F(  1,    20) =   34.67
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.3888
                                                       Root MSE      =   5.296

------------------------------------------------------------------------------
             |               Robust
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |  -2.746398   .4664111    -5.89   0.000    -3.719314   -1.773481
       _cons |   122.0202   16.44681     7.42   0.000     87.71274    156.3276
------------------------------------------------------------------------------

Anyway, I am not sure what to do. If I have svyset the data, then I am still semi-confident that I should use subpop rather than if for subsample selection. But I am not sure what the answer is to my original question about using [pw = whatever] vs svy:

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Mark Schaffer

Join Date: Mar 2014
Posts: 324

20 Feb 2015, 05:05

Maybe this helps a bit - below is how to replicate the survey linear regression using suest. I think the reason straight regress,rob doesn't work is that it automatically uses the standard OLS small-sample dof adjustment.

Code:

. svy: regress mpg turn if foreign
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs      =        22
Number of PSUs     =        22                  Population size    =        22
                                                Design df          =        21
                                                F(   1,     21)    =     36.41
                                                Prob > F           =    0.0000
                                                R-squared          =    0.3888

------------------------------------------------------------------------------
             |             Linearized
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |  -2.746398   .4551706    -6.03   0.000    -3.692977   -1.799819
       _cons |   122.0202   16.05044     7.60   0.000     88.64146    155.3989
------------------------------------------------------------------------------

Code:

. qui regress mpg turn if foreign

. suest .

Robust results for .                              Number of obs   =         22

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mean         |
        turn |  -2.746398   .4551706    -6.03   0.000    -3.638516    -1.85428
       _cons |   122.0202   16.05044     7.60   0.000     90.56189    153.4785
-------------+----------------------------------------------------------------
lnvar        |
       _cons |   3.333912   .4099026     8.13   0.000     2.530517    4.137306
------------------------------------------------------------------------------

Comment

Mark Schaffer

Join Date: Mar 2014

Posts: 324
#7

20 Feb 2015, 05:42

Actually, maybe this is more transparent. In both cases, taking the regress,rob SE and multiplying by sqrt(20/22) removes the aforementioned small-sample dof adjustment. The first example replicates the SE of the survey linear regression; the second replicates the SE using the subpop() option.

Code:

. qui svy: regress mpg turn if foreign . . di _se[turn] .45517062 . . qui regress mpg turn if foreign, rob . di _se[turn] * sqrt(20/22) * sqrt(22/21) .45517062

Code:

. qui svy, subpop(for) : regress mpg turn . . di _se[turn] .44774109 . . qui regress mpg turn if foreign, rob . di _se[turn] * sqrt(20/22) * sqrt(74/73) .44774109
Comment

Richard Williams

Join Date: Apr 2014
Posts: 5008

20 Feb 2015, 05:55

Thanks Mark. For what I assume are similar reasons, glm also will produce the same results:

Code:

. glm mpg turn if foreign, vce(robust)

Iteration 0:   log pseudolikelihood = -66.841263  

Generalized linear models                          No. of obs      =        22
Optimization     : ML                              Residual df     =        20
                                                   Scale parameter =  28.04784
Deviance         =  560.9567723                    (1/df) Deviance =  28.04784
Pearson          =  560.9567723                    (1/df) Pearson  =  28.04784

Variance function: V(u) = 1                        [Gaussian]
Link function    : g(u) = u                        [Identity]

                                                   AIC             =  6.258297
Log pseudolikelihood = -66.84126307                BIC             =  499.1359

------------------------------------------------------------------------------
             |               Robust
         mpg |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |  -2.746398   .4551706    -6.03   0.000    -3.638516    -1.85428
       _cons |   122.0202   16.05044     7.60   0.000     90.56189    153.4785
------------------------------------------------------------------------------

. test turn

 ( 1)  [mpg]turn = 0

           chi2(  1) =   36.41
         Prob > chi2 =    0.0000

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 701
#9

20 Feb 2015, 09:36

To address Richard's original question

If I have svyset the data, then I am still semi-confident that I should use subpop rather than if for subsample selection.
But I am not sure what the answer is to my original question about using [pw = whatever] vs svy:

In the remarks section of the manual entry [SVY] say estimation there is an
overview of survey analysis that details what it means when you use the
svy prefix. For example, it details the difference between regress
and svy: regress. See page 92 of the survey manual in Stata 13.

Subpopulation estimation in a nut shell:

The subpop() option is there to handle the common occurrence of subpopulation estimation for survey data.
The basic principle for subpop estimation is that the survey design is fixed, so in order to account for sample
to sample variation, we must assume that future samples will be taken using the current survey design. So even
if the overall sample size is fixed in repeated samples, this method accounts for the fact that the subpop sample
size is now a random entity.

With the if clause, we assume that we can ignore the out-of-subpop units in order to estimate the standard
errors. This breaks the assumption that the survey design is fixed. A fundamental consequence of using this
method is that you are fixing the subpop sample size for variance estimation.
1 like
Comment
Jakob Schaefer

Join Date: Mar 2017

Posts: 7
#10

21 Jun 2018, 08:59

Thanks to everybody for this very interesting discussion! Just to make sure, I understand correctly:

I thought, the basic reason for using subpop was this: "If the data set is subset, meaning that observations not to be included in the subpopulation are deleted from the data set, the standard errors of the estimates cannot be calculated correctly. When the subpopulation option(s) is used, only the cases defined by the subpopulation are used in the calculation of the estimate, but all cases are used in the calculation of the standard errors." (source: https://stats.idre.ucla.edu/stata/fa...data-in-stata/ )
I have to admit, I'm not a hundred percent sure what "cannot be calculated correctly" amounts to, but I guess the above regression are good examples.

So, how does that tie in with what Jeff writes? I.e. that this method "accounts for the fact that the subpop sample size is now a random entity" and that a "fundamental consequence of using this method [the if clause] is that you are fixing the subpop sample size for variance estimation"? I am just not a hundred percent sure as to what that means with regard to survey data. (Sorry if this is basic, would appreciate some help/explanation nonetheless.)

And, Mark, it was really interesting to see how you reproduced the svy subpop results with

Code:

qui regress mpg turn if foreign, rob . di _se[turn] * sqrt(20/22) * sqrt(74/73) .44774109

Am I right in assuming that the 20 are the dof of the simple regression, whereas 22 is the number of PSUs of the svy subpop regression? If yes, can somebody give me a quick clue as to what statistical rationale is there behind using sqrt(20/22) and sqrt(74/73) to be able to reproduce the svy subpop results?
Comment
Jakob Schaefer

Join Date: Mar 2017

Posts: 7
#11

24 Jun 2018, 04:58

Maybe another interesting difference for multilevel models is sth Steve Samuels says in this other discussion:

Originally posted by Steve Samuels View Post

... after svyset, Stata expects you to include the PSU as a level in the multilevel model and to specify the PSU design weight (1/(selection probability of the PSU).

If [pweight=some weight] can be used in multilevel models without specifying a further level for PSUs, that would be quite a reduction of complexity. But I might be just as well mistaken!
Comment
Hassen Ali

Join Date: May 2018

Posts: 39
#12

24 Jun 2018, 19:27

Dear all, Thank you very much! I have learned a lot from your daily posts.
With Best Wishes,Hassen
Comment

David Speed

Join Date: May 2015
Posts: 98

#13

29 Mar 2019, 06:25

Hi Everyone,

I'm really late to the party on this one, but I had a similar question to what Richard initially asked:

Is there any compelling reason to svyset the data and use svy:, as opposed to just adding [pw = whatever] to each estimation command

With respect to the Canadian Community Health Survey, Statistics Canada has a complex weighting system that folds all information about strata, cluster, primary sampling unit, etc. into a single person-level master weight <wts_m>. This is the only weighting information provided that is meant to "debias" the eventual estimates. When using svy:, there is a slight change in the df relative to the regress with weights option; however, Stata is now assuming that the number of PSUs is equal to the sample size, which is extremely wrong.

Code:

. svy: regress inc sex age-dmar3 
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                 Number of obs     =      22,929
Number of PSUs     =    22,929                 Population size   =  25,858,251
                                               Design df         =      22,928
                                               F(   5,  22924)   =      174.08
                                               Prob > F          =      0.0000
                                               R-squared         =      0.1086

------------------------------------------------------------------------------
             |             Linearized
         inc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .4384666   .0637057     6.88   0.000     .3135992     .563334
         age |  -.1105906    .009394   -11.77   0.000    -.1290034   -.0921777
    minority |  -1.517119   .0874858   -17.34   0.000    -1.688598   -1.345641
       dmar2 |  -1.395568   .0983552   -14.19   0.000    -1.588351   -1.202785
       dmar3 |  -1.331934   .0839483   -15.87   0.000    -1.496478    -1.16739
       _cons |   6.855946   .0868685    78.92   0.000     6.685678    7.026214
------------------------------------------------------------------------------

. regress inc sex age-dmar3 [pw=wts_m]
(sum of wgt is 25,858,250.63)

Linear regression                               Number of obs     =     22,929
                                                F(5, 22923)       =     174.07
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1086
                                                Root MSE          =     2.7095

------------------------------------------------------------------------------
             |               Robust
         inc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .4384666   .0637126     6.88   0.000     .3135856    .5633476
         age |  -.1105906    .009395   -11.77   0.000    -.1290054   -.0921757
    minority |  -1.517119   .0874953   -17.34   0.000    -1.688616   -1.345623
       dmar2 |  -1.395568    .098366   -14.19   0.000    -1.588372   -1.202764
       dmar3 |  -1.331934   .0839574   -15.86   0.000    -1.496496   -1.167372
       _cons |   6.855946    .086878    78.91   0.000     6.685659    7.026233
------------------------------------------------------------------------------

As noted above in this thread already, the "best practices" question is complicated when discussing subsample selections because "if" vs. "subpop()" produces different model estimates.

Is svyset necessary, or is only specifying the pw option acceptable with an entire sample?
Is svyset necessary, or is only specifying the pw option acceptable when looking at subsamples?
- Large datasets always have at least some missing data, those observations are automatically excluded with <regress>, will their exclusion create an artificial subsample within a model, even though no subsample/if option was selected?

Basically, if you use svyset: Stata will make incorrect assumptions about the structure of your data, and if you don't svyset: it messes up subsample estimates, so I'm not sure how to proceed.

Thanks everyone!

Cheers,

David.

Announcement