Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • why does svy reg include observations whose outcome variable is missing?


    Dear Statalists,

    I do not understand why a weighted linear regression,
    Code:
     svy:reg
    , includes observations that have missing values in the outcome variable.

    In the example below, the sample size used in the model is 48974, which includes 10421 respondents whose outcome variable value is missing.

    Code:
     svyset psu [pw=pw_xw], strata(strata) singleunit(scaled)
    
          pweight: pw_xw
              VCE: linearized
      Single unit: scaled
         Strata 1: strata
             SU 1: psu
            FPC 1: <zero>
    
    . svy: reg v1 v2
    (running regress on estimation sample)
    
    Survey: Linear regression
    
    Number of strata   =     1,769                  Number of obs     =     48,974
    Number of PSUs     =     7,699                  Population size   = 38,107.278
                                                    Design df         =      5,930
                                                    F(   1,   5930)   =     161.60
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.0043
    
    ------------------------------------------------------------------------------
                 |             Linearized
              v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              v2 |  -.0985989   .0077561   -12.71   0.000    -.1138037   -.0833941
           _cons |   3.633923   .0065863   551.74   0.000     3.621011    3.646835
    ------------------------------------------------------------------------------
    Note: 7 strata omitted because they contain no population members.
    
    . count if v1==. & e(sample)==1
      10,421
    
    .
    This problem also occurs when I declare only sampling weights in svyset.

    Code:
    . svyset   [pw=pw_xw]
    
          pweight: pw_xw
              VCE: linearized
      Single unit: missing
         Strata 1: <one>
             SU 1: <observations>
            FPC 1: <zero>
    
    . svy: reg v1 v2
    (running regress on estimation sample)
    
    Survey: Linear regression
    
    Number of strata   =         1                  Number of obs     =     49,053
    Number of PSUs     =    49,053                  Population size   = 38,107.278
                                                    Design df         =     49,052
                                                    F(   1,  49052)   =     132.80
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.0043
    
    ------------------------------------------------------------------------------
                 |             Linearized
              v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              v2 |  -.0985989   .0085559   -11.52   0.000    -.1153686   -.0818292
           _cons |   3.633923   .0057285   634.36   0.000     3.622695    3.645151
    ------------------------------------------------------------------------------
    
    . count if v1==. & e(sample)==1
      10,498



    Earlier posts suggest that the differences between using the prefix svy and putting [pw==...] at the end is related to subpopulation. But I do not think the subpopulation issue is involved here.

    I am using Stata15 MP, windows, 64-bit.



    Many thanks.

    Regards,
    Min


  • #2
    You didn't get a quick answer. You might improve your chances of a useful response by providing a small sample of the data and program so someone can replicate your analysis.
    I'm afraid I don't use survey so I can't really help you there.

    Comment


    • #3
      I've tried some simple examples and can't replicate. That leads me to my generic advice of making sure your Stata is up to date. Also maybe try running it on a different machine and see if the problem occurs there.

      Also, what happens if you don't use svy at all?
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Many thanks for your replies.

        Here is some sample data generated by dataex


        Code:
        . svyset   [pw=pw_xw] 
        
              pweight: pw_xw
                  VCE: linearized
          Single unit: missing
             Strata 1: <one>
                 SU 1: <observations>
                FPC 1: <zero>
        
        . qui: svy: reg v1 v2
        
        
        . dataex v1 v2  if e(sample)==1 & v1==.
        
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input double v1 float v2
        . 0
        . 0
        . 0
        . 0
        . 1
        . 1
        . 0
        . 1
        . 1
        . 0
        . 1
        . 0
        . 1
        . 0
        . 1
        . 0
        . 1
        . 0
        . 0
        . 1
        . 0
        . 0
        . 0
        . 1
        . 0
        . 1
        . 1
        . 0
        . 0
        . 1
        . 0
        . 0
        . 1
        . 0
        . 1
        . 1
        . 1
        . 1
        . 0
        . 0
        . 1
        . 0
        . 0
        . 1
        . 0
        . 1
        . 1
        . 1
        . 0
        . 1
        . 1
        . 1
        . 1
        . 0
        . 0
        . 0
        . 0
        . 0
        . 0
        . 1
        . 0
        . 0
        . 0
        . 0
        . 0
        . 0
        . 0
        . 1
        . 0
        . 0
        . 0
        . 0
        . 1
        . 1
        . 1
        . 0
        . 0
        . 0
        . 0
        . 0
        . 0
        . 1
        . 0
        . 1
        . 0
        . 1
        . 0
        . 1
        . 1
        . 1
        . 1
        . 1
        . 1
        . 1
        . 0
        . 0
        . 1
        . 0
        . 0
        . 0
        end

        I cannot replicate the results using Stata built-in data either.


        Without using svy, the results does not include missing data as it should be.

        Code:
        .  reg v1 v2
        
              Source |       SS           df       MS      Number of obs   =    38,555
        -------------+----------------------------------   F(1, 38553)     =     85.75
               Model |  47.9973611         1  47.9973611   Prob > F        =    0.0000
            Residual |   21579.416    38,553  .559733769   R-squared       =    0.0022
        -------------+----------------------------------   Adj R-squared   =    0.0022
               Total |  21627.4134    38,554  .560964189   Root MSE        =    .74815
        
        ------------------------------------------------------------------------------
                  v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                  v2 |  -.0710717    .007675    -9.26   0.000    -.0861149   -.0560285
               _cons |   3.640312   .0050938   714.65   0.000     3.630327    3.650296
        ------------------------------------------------------------------------------

        It really bothers me.

        Version is 15.1.
        Current update level is 27 Jun 2018.


        Regards,
        Min

        Comment


        • #5
          Sorry I forgot to mention that I have tried to using a different machine indeed and got the same results.

          Comment


          • #6
            Min and I communicated off-list about this. For some reason, Min's data set has 11,008 cases with zero weights. That seems weird to me, but it is not illegal. Those 11,008 cases apparently do not have much effect on the results.

            Code:
            . count if pw_xw == 0
              11,008
            
            . svy: reg v1 v2
            (running regress on estimation sample)
            
            Survey: Linear regression
            
            Number of strata   =         1                  Number of obs     =     49,053
            Number of PSUs     =    49,053                  Population size   = 38,107.278
                                                            Design df         =     49,052
                                                            F(   1,  49052)   =     132.80
                                                            Prob > F          =     0.0000
                                                            R-squared         =     0.0043
            
            ------------------------------------------------------------------------------
                         |             Linearized
                      v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                      v2 |  -.0985989   .0085559   -11.52   0.000    -.1153686   -.0818292
                   _cons |   3.633923   .0057285   634.36   0.000     3.622695    3.645151
            ------------------------------------------------------------------------------
            
            . svy, subpop( if !missing( v1 )): reg v1 v2
            (running regress on estimation sample)
            
            Survey: Linear regression
            
            Number of strata   =         1                  Number of obs     =     50,994
            Number of PSUs     =    50,994                  Population size   =     39,986
                                                            Subpop. no. obs   =     38,045
                                                            Subpop. size      = 38,107.278
                                                            Design df         =     50,993
                                                            F(   1,  50993)   =     132.80
                                                            Prob > F          =     0.0000
                                                            R-squared         =     0.0043
            
            ------------------------------------------------------------------------------
                         |             Linearized
                      v1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                      v2 |  -.0985989   .0085559   -11.52   0.000    -.1153685   -.0818292
                   _cons |   3.633923   .0057285   634.36   0.000     3.622695    3.645151
            ------------------------------------------------------------------------------
            For more, see https://www.stata.com/support/faqs/s...-zero-weights/

            I'm not sure if what you are seeing is a Stata bug or not. Maybe Stata does not check for missing values when cases have 0 weights. If it is a bug, it may or may not be a harmless one -- maybe it would matter more in a different data set. This might be worth reporting to Stata Tech Support.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            Stata Version: 17.0 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment

            Working...
            X