why does svy reg include observations whose outcome variable is missing?

Min Zhang

Join Date: Mar 2015
Posts: 40

why does svy reg include observations whose outcome variable is missing?

19 Nov 2018, 07:12

Dear Statalists,

I do not understand why a weighted linear regression,

Code:

 svy:reg

, includes observations that have missing values in the outcome variable.

In the example below, the sample size used in the model is 48974, which includes 10421 respondents whose outcome variable value is missing.

Code:

 svyset psu [pw=pw_xw], strata(strata) singleunit(scaled)

      pweight: pw_xw
          VCE: linearized
  Single unit: scaled
     Strata 1: strata
         SU 1: psu
        FPC 1: <zero>

. svy: reg v1 v2
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =     1,769                  Number of obs     =     48,974
Number of PSUs     =     7,699                  Population size   = 38,107.278
                                                Design df         =      5,930
                                                F(   1,   5930)   =     161.60
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0043

------------------------------------------------------------------------------
             |             Linearized
          v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          v2 |  -.0985989   .0077561   -12.71   0.000    -.1138037   -.0833941
       _cons |   3.633923   .0065863   551.74   0.000     3.621011    3.646835
------------------------------------------------------------------------------
Note: 7 strata omitted because they contain no population members.

. count if v1==. & e(sample)==1
  10,421

.

This problem also occurs when I declare only sampling weights in svyset.

Code:

. svyset   [pw=pw_xw]

      pweight: pw_xw
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

. svy: reg v1 v2
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs     =     49,053
Number of PSUs     =    49,053                  Population size   = 38,107.278
                                                Design df         =     49,052
                                                F(   1,  49052)   =     132.80
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0043

------------------------------------------------------------------------------
             |             Linearized
          v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          v2 |  -.0985989   .0085559   -11.52   0.000    -.1153686   -.0818292
       _cons |   3.633923   .0057285   634.36   0.000     3.622695    3.645151
------------------------------------------------------------------------------

. count if v1==. & e(sample)==1
  10,498

Earlier posts suggest that the differences between using the prefix svy and putting [pw==...] at the end is related to subpopulation. But I do not think the subpopulation issue is involved here.

I am using Stata15 MP, windows, 64-bit.

Many thanks.

Regards,
Min

Tags: missing data, svy

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

20 Nov 2018, 10:26

You didn't get a quick answer. You might improve your chances of a useful response by providing a small sample of the data and program so someone can replicate your analysis.
I'm afraid I don't use survey so I can't really help you there.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#3

20 Nov 2018, 13:10

I've tried some simple examples and can't replicate. That leads me to my generic advice of making sure your Stata is up to date. Also maybe try running it on a different machine and see if the problem occurs there.

Also, what happens if you don't use svy at all?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Min Zhang

Join Date: Mar 2015
Posts: 40

21 Nov 2018, 06:06

Many thanks for your replies.

Here is some sample data generated by dataex

Code:

. svyset   [pw=pw_xw] 

      pweight: pw_xw
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

. qui: svy: reg v1 v2


. dataex v1 v2  if e(sample)==1 & v1==.

* Example generated by -dataex-. To install: ssc install dataex
clear
input double v1 float v2
. 0
. 0
. 0
. 0
. 1
. 1
. 0
. 1
. 1
. 0
. 1
. 0
. 1
. 0
. 1
. 0
. 1
. 0
. 0
. 1
. 0
. 0
. 0
. 1
. 0
. 1
. 1
. 0
. 0
. 1
. 0
. 0
. 1
. 0
. 1
. 1
. 1
. 1
. 0
. 0
. 1
. 0
. 0
. 1
. 0
. 1
. 1
. 1
. 0
. 1
. 1
. 1
. 1
. 0
. 0
. 0
. 0
. 0
. 0
. 1
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 1
. 0
. 0
. 0
. 0
. 1
. 1
. 1
. 0
. 0
. 0
. 0
. 0
. 0
. 1
. 0
. 1
. 0
. 1
. 0
. 1
. 1
. 1
. 1
. 1
. 1
. 1
. 0
. 0
. 1
. 0
. 0
. 0
end

I cannot replicate the results using Stata built-in data either.

Without using svy, the results does not include missing data as it should be.

Code:

.  reg v1 v2

      Source |       SS           df       MS      Number of obs   =    38,555
-------------+----------------------------------   F(1, 38553)     =     85.75
       Model |  47.9973611         1  47.9973611   Prob > F        =    0.0000
    Residual |   21579.416    38,553  .559733769   R-squared       =    0.0022
-------------+----------------------------------   Adj R-squared   =    0.0022
       Total |  21627.4134    38,554  .560964189   Root MSE        =    .74815

------------------------------------------------------------------------------
          v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          v2 |  -.0710717    .007675    -9.26   0.000    -.0861149   -.0560285
       _cons |   3.640312   .0050938   714.65   0.000     3.630327    3.650296
------------------------------------------------------------------------------

It really bothers me.

Version is 15.1.
Current update level is 27 Jun 2018.

Regards,
Min

Comment

Min Zhang

Join Date: Mar 2015

Posts: 40
#5

21 Nov 2018, 07:08

Sorry I forgot to mention that I have tried to using a different machine indeed and got the same results.
Comment

Richard Williams

Join Date: Apr 2014
Posts: 4987

22 Nov 2018, 09:46

Min and I communicated off-list about this. For some reason, Min's data set has 11,008 cases with zero weights. That seems weird to me, but it is not illegal. Those 11,008 cases apparently do not have much effect on the results.

Code:

. count if pw_xw == 0
  11,008

. svy: reg v1 v2
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs     =     49,053
Number of PSUs     =    49,053                  Population size   = 38,107.278
                                                Design df         =     49,052
                                                F(   1,  49052)   =     132.80
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0043

------------------------------------------------------------------------------
             |             Linearized
          v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          v2 |  -.0985989   .0085559   -11.52   0.000    -.1153686   -.0818292
       _cons |   3.633923   .0057285   634.36   0.000     3.622695    3.645151
------------------------------------------------------------------------------

. svy, subpop( if !missing( v1 )): reg v1 v2
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs     =     50,994
Number of PSUs     =    50,994                  Population size   =     39,986
                                                Subpop. no. obs   =     38,045
                                                Subpop. size      = 38,107.278
                                                Design df         =     50,993
                                                F(   1,  50993)   =     132.80
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0043

------------------------------------------------------------------------------
             |             Linearized
          v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          v2 |  -.0985989   .0085559   -11.52   0.000    -.1153685   -.0818292
       _cons |   3.633923   .0057285   634.36   0.000     3.622695    3.645151
------------------------------------------------------------------------------

For more, see https://www.stata.com/support/faqs/s...-zero-weights/

I'm not sure if what you are seeing is a Stata bug or not. Maybe Stata does not check for missing values when cases have 0 weights. If it is a bug, it may or may not be a harmless one -- maybe it would matter more in a different data set. This might be worth reporting to Stata Tech Support.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Announcement

why does svy reg include observations whose outcome variable is missing?

Comment

Comment

Comment

Comment

Comment