Cluster-robust SEs with survey data

Michael Kaiser

Join Date: Jan 2016

Posts: 1
#1

Cluster-robust SEs with survey data

20 Jan 2016, 09:45

I'm working with data from a clustered sample where observations were sampled with a given probability which is to be used as a sampling weight (pweight). There are two ways to obtain the correct point estimates: I) using reg yvar xvar [pw = pweight] or ii) using svyset[pw = pweight] and then svy : reg yvar xvar These return identical point estimates (as they should). However, once one wants to introduce cluster-robust standard errors, the "manual" approach and the svyset approach return slightly different results. What I mean by "manual" is a command of the form: reg yvar xvar [pw = pweight], cluster(clustervar) as opposed to: svyset clustervar [pw = pweight] and then svy : reg yvar xvar. Here is a little code example to illustrate this with some numbers:

Code:

sysuse auto set seed 92122 *a variable containing random integers from 1 thru 4 designating fake clusters gen mycluster = ceil(4*uniform()) *random probability weights as the inverse of some random sampling probability gen mypw = 1/uniform() *run the "manual" regression reg price mpg weight [pw = mypw], cluster(mycluster) *using svy design svyset mycluster [pw = mypw] svy : reg price mpg weight

The standard errors are very close to one another but not identical (mpg is 72.48 and 71.48 and weight has 0.969 and 0.956). Stata calls the ones from the svyset-regression "Linearized" so I suppose that's where the difference comes from - potentially a Taylor expansion? Could somebody point me towards the precise (mathematical) difference? Are the patterns, i.e. one is always larger than the other?
I'm using Stata 13. I've posted this question before in the Cross Validated community but have not received an answer http://stats.stackexchange.com/quest...-survey-design.
Tags: cluster-robust, sampling, standard error, survey
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

20 Jan 2016, 16:54

A standard error for regress with a cluster() option will always be larger than that from svy: regress, with the ratio of squared standard errors equal to \(\dfrac{n}{n-k}\), where \(k\) is the number of predictors, including the intercept, and \(n\) is the sample size:

Unfortunately, your example is not realistic, because you've created a different cluster for each observation. It doesn't really matter for your question. Here's a modification to show what is going on.

Code:

sysuse auto, clear set seed 92122 gen mkr = substr(make,1,2) codebook mkr // 23 clusters *run the "manual" regression reg price mpg weight [pw = gear], cluster(mkr) scalar v1 = _se[mpg]^2 scalar k = e(rank) // number of predictors + 1 scalar n = e(N) *using svy design svyset mkr [pw = gear] svy : reg price mpg weight scalar v2 = _se[mpg]^2 di " v1 = " v1 " v2 = " v2 di " n = "n " k = " k di " v1/v2 = "v1/v2 di " (n-1)/(n-k) = "(n-1)/(n-k)

The results of the display statements:

Code:

v1 = 5364.5846 v2 = 5217.6097 n = 74 k = 3 v1/v2 = 1.028169 (n-1)/(n-k) = 1.028169

The values in the last two lines are identical. See page 469 of the Stata 14 Manual entry for **_robust**, which acknowledges the difference.

Last edited by Steve Samuels; 20 Jan 2016, 17:46.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

20 Jan 2016, 17:58

Correction:s
1) In the first paragraph of my answer above, \(\dfrac{n}{n-k}\) should be \(\dfrac{n-1}{n-k}\)

2) Also, the Manual entry refers to a multiplier \(\dfrac{n}{n-k}\) to make (what look like) the same two calculations match. I can't account for the discrepancy.

Last edited by Steve Samuels; 20 Jan 2016, 18:49.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Dung Le

Join Date: May 2018

Posts: 120
#4

21 May 2018, 09:37

Hi all,

I have a question regarding robust and survey data commands, that goes like why do not we need to put robust command when we use survey data command (Stata does not allow robust option with svy prefix). If we do not use robust option, does the survey data capture heterogeneous problems caused by OLS regression?

Thank you.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

22 May 2018, 16:37

All of the methods for variance estimation with svy commands are already robust; hence no further robust option is necessary.

Last edited by Steve Samuels; 22 May 2018, 16:46.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

Cluster-robust SEs with survey data

Comment

Comment

Comment

Comment