Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cluster-robust SEs with survey data

    I'm working with data from a clustered sample where observations were sampled with a given probability which is to be used as a sampling weight (pweight). There are two ways to obtain the correct point estimates: I) using reg yvar xvar [pw = pweight] or ii) using svyset[pw = pweight] and then svy : reg yvar xvar These return identical point estimates (as they should). However, once one wants to introduce cluster-robust standard errors, the "manual" approach and the svyset approach return slightly different results. What I mean by "manual" is a command of the form: reg yvar xvar [pw = pweight], cluster(clustervar) as opposed to: svyset clustervar [pw = pweight] and then svy : reg yvar xvar. Here is a little code example to illustrate this with some numbers:
    Code:
    sysuse auto
    set seed 92122
    *a variable containing random integers from 1 thru 4 designating fake clusters
    gen mycluster = ceil(4*uniform())
    *random probability weights as the inverse of some random sampling probability
    gen mypw = 1/uniform()
    
    *run the "manual" regression
    reg price mpg weight [pw = mypw], cluster(mycluster)
    
    *using svy design
    svyset mycluster [pw = mypw]
    svy : reg price mpg weight
    The standard errors are very close to one another but not identical (mpg is 72.48 and 71.48 and weight has 0.969 and 0.956). Stata calls the ones from the svyset-regression "Linearized" so I suppose that's where the difference comes from - potentially a Taylor expansion? Could somebody point me towards the precise (mathematical) difference? Are the patterns, i.e. one is always larger than the other?
    I'm using Stata 13. I've posted this question before in the Cross Validated community but have not received an answer http://stats.stackexchange.com/quest...-survey-design.

  • #2
    A standard error for regress with a cluster() option will always be larger than that from svy: regress, with the ratio of squared standard errors equal to \(\dfrac{n}{n-k}\), where \(k\) is the number of predictors, including the intercept, and \(n\) is the sample size:

    Unfortunately, your example is not realistic, because you've created a different cluster for each observation. It doesn't really matter for your question. Here's a modification to show what is going on.
    Code:
    sysuse auto, clear
    set seed 92122
    gen mkr = substr(make,1,2)
    codebook mkr // 23 clusters
    *run the "manual" regression
    reg price mpg weight [pw = gear], cluster(mkr)
    scalar v1 = _se[mpg]^2
    scalar k = e(rank) // number of predictors + 1
    scalar n = e(N)
    
    *using svy design
    svyset mkr [pw = gear]
    svy : reg price mpg weight
    scalar v2 = _se[mpg]^2
    di " v1 = " v1   "    v2 = " v2
    di " n =  "n   "    k = " k
    di " v1/v2 = "v1/v2
    di " (n-1)/(n-k) = "(n-1)/(n-k)
    The results of the display statements:
    Code:
    v1 = 5364.5846 v2 = 5217.6097
    n = 74 k = 3
    v1/v2 = 1.028169
    (n-1)/(n-k) = 1.028169
    The values in the last two lines are identical. See page 469 of the Stata 14 Manual entry for **_robust**, which acknowledges the difference.
    Last edited by Steve Samuels; 20 Jan 2016, 17:46.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Correction:s
      1) In the first paragraph of my answer above, \(\dfrac{n}{n-k}\) should be \(\dfrac{n-1}{n-k}\)

      2) Also, the Manual entry refers to a multiplier \(\dfrac{n}{n-k}\) to make (what look like) the same two calculations match. I can't account for the discrepancy.
      Last edited by Steve Samuels; 20 Jan 2016, 18:49.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Hi all,

        I have a question regarding robust and survey data commands, that goes like why do not we need to put robust command when we use survey data command (Stata does not allow robust option with svy prefix). If we do not use robust option, does the survey data capture heterogeneous problems caused by OLS regression?

        Thank you.

        Comment


        • #5
          All of the methods for variance estimation with svy commands are already robust; hence no further robust option is necessary.
          Last edited by Steve Samuels; 22 May 2018, 16:46.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment

          Working...
          X