Small # Clusters Calculating SEs with Survey Data

Jamie Daw

Join Date: May 2018

Posts: 7
#1

Small # Clusters Calculating SEs with Survey Data

18 May 2018, 12:24

I am running a difference-in-differences analysis to estimate the association between a state-level policy and a set of outcomes using individual-level survey data. To calculate standard errors, I want to be able to account for survey weights, the state-level implementation of the policy (clustering by state), and the fact that I have a small # of clusters (15 states total). I've tried a couple of approaches and am having trouble combining commands in STATA to do what I think is necessary.

Note: The analysis is on a subgroup of the total survey sample. The variable 'sample' is an indicator for whether an individual is included in the study sample.

1) SEs calculated using svy (according to instructions provided by the CDC):
svyset _n [pweight=wt], strata(sud_nest) fpc(totcnt)
global did postXtreat i.state i.yy
global covariates age race etc.

svy, subpop(sample): reg outcome $did $covariates
2) SEs calculated using reg, cluster with weights:

reg outcome $did $covariates if sample==1 [pw=wt], cluster(state)

- This gives larger SEs than option #1 as expected with clustering
- However, one concern I have is using an "if" statement which is not recommended for subsetting survey data (and can result in incorrect/often inflated errors)
3) Wild cluster bootstrap SEs (to address small # cluster problem) ***can't implement this****
clustse reg outcome $did $covariates if sample==1 [pw=wt], cluster(state) method(wild) reps(500)
[INDENT=2]- multiple errors occur: no "if" allowed, no weights allowed, factor variable operators not allowed[/INDENT]
I'd appreciate any advice on this problem more generally and STATA commands that would help me in this situation. One road I haven't gone down yet is randomization inference - would also appreciate if there are any helpful packages out there to implement this in a svy context.
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

18 May 2018, 19:59

Welcome to Statalist, Jamie!

That is a very strange svyset. It implies that the sampling design was a simple random sample of the population in each stratum. Would you please provide a link to the survey design and to the CDC instructions? I'll leave comments on your model to others.

And before you post further, please read FAQ12 on how to write good questions. In particular, put commands, results, and data listings between CODE delimiters, described in the FAQ. By the way, correct spelling is "Stata" not "STATA". See the last FAQ and this recent thread.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Jamie Daw

Join Date: May 2018
Posts: 7

21 May 2018, 09:49

Hi Steve,

Thanks for your comment and sorry for not following good forum practice! Below, I provide an update with (what I hope is) proper formatting and results.

Regarding the svyset:
The CDC instructions for setting up the survey in Stata are here.
More information on the survey design is available here.

Code:

svyset _n [pweight=wtanal], strata(sud_nest) fpc(totcnt) 
global covariates i.mat_age_coll i.race i.hisp_coll i.marr_coll i.ed_coll c.unemploy i.mm_dob c.inc_a
global did postXtreat i.state_fips i.yy

*1)SEs calculated using svy (according to instructions provided by the CDC):

svy, subpop(sample): reg outcome $did $covariates
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =       450                  Number of obs     =    125,450
Number of PSUs     =   125,450                  Population size   =  5,162,528
                                                Subpop. no. obs   =    125,450
                                                Subpop. size      =  5,162,528
                                                Design df         =    125,000
                                                F(  52, 124949)   =     355.86
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2042

-----------------------------------------------------------------------------------
                  |             Linearized
  outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
 postXtreat |    .005467    .008434     0.65   0.517    -.0110636    .0219975
......(not showing all results for brevity)

*2)SEs calculated using reg, cluster with weights:

reg outcome $did $covariates if sample==1 [pw=wtanal], cluster(state_fips)
(sum of wgt is 5,162,527.54472)

Linear regression                               Number of obs     =    125,450
                                                F(13, 14)         =          .
                                                Prob > F          =          .
                                                R-squared         =     0.2042
                                                Root MSE          =     .36017

                                 (Std. Err. adjusted for 15 clusters in state_fips)
-----------------------------------------------------------------------------------
                  |               Robust
  outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
postXtreat |    .005467    .021519     0.25   0.803    -.0406866    .0516206
......(not showing all results for brevity)
 
 *This gives larger SEs than option #1 as expected with clustering
 *However, one concern I have is using an "if" statement which is not recommended for subsetting survey data (and can result in incorrect/often inflated errors)

*3) Wild cluster bootstrap SEs (to address small # cluster problem) ***can't implement this****
         
clustse reg outcome $did $covariates if sample==1 [pw=wtanal], cluster(state_fips) method(wild) reps(500)
factor-variable operators not allowed
r(101);

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

22 May 2018, 15:22

Thanks for the design information. I understand the svyset statement now: DC takes a systematic sample of birth certificates in each state from lists provided by the state. For the purpose of estimating standard errors, these are treated as simple random samples.

Choice of analysis: don't use a svy analysis. States are the units for the policy application and so you must cluster by state. I'm not expert enough in these analyses to comment on which command is best. Cameron's 2015 article "A Practitioner's Guide to Cluster-Robust Inference states that clustering on the 50 states is common;

I will point out that your DID model is not correctly formulated. You have only the postXtreat interaction term: that is one term + the constant to estimate the means for the four intervention-year combinations.. You need to add the main effects also. Use se factor variable notation:

Code:

regress outcome post##treat // (+ covariates)

Because "State" is the unit of policy application. state-level covariates might be important. For those, like region, that are constant over the study period, you might consider interactions with the post & treatment variables. This would permit conclusions such as "The intervention was more effective in eastern states" or "..in states with well-funded prenatal programs prior to the intervention". State-level factors that change over time might be included as main effects.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
James Buszkiewicz

Join Date: May 2018

Posts: 1
#5

22 May 2018, 18:38

I actually had the same question about using both the svyset information from the National Health Interview Survey (variance estimation guide can be found here)

Code:

svyset [pweight=wtfa], strata(strat_p) psu(psu_p)

Like, Jamie, I examining a policy effect using a difference in difference in difference model with a binomial outcome with adjustment for individual and state-level covariates, state and year fixed effects, and state linear time trends. I cannot post exact code but my model is something like:

Code:

svy, subpop(sample): logit outcome c.exposure##i.treatment i.state i.year i.state#c.year state_covars individual_covars

I would also like to take into account the serial correlation within states overtime by clustering on state id, but Stata does not allow this. Mr. Samuels, you recommend no accounting for the survey sampling and design weights and instead suggest clustering by state using robust standard errors, is that correct? Is this commonly done? I see a number of articles that use this same method and clustering but it appears that it is incompatible with using "svyset" and "svy" commands... so I imagine that must be taking a similar approach that you suggest... is that correct?

(this is my first post to Statalist so I apologize if I did not follow proper forum etiquette)

Many thanks in advance,
James
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

22 May 2018, 19:38

Welcome to Statalist, James! Thanks for putting your code between code delimiters.

I'm far from expert in the econometric literature of cluster robust estimates, but yes, it is done, according to the 2015 Cameron and Miller paper that I referenced. On page 18:

Applied economists routinely use data from complex surveys, controlling for clustering by using a cluster-robust variance matrix estimate. At the minimum one should cluster at the level of the primary sampling unit, though often there is reason to cluster at a broader level, such as clustering on state if regressors and errors are correlated within state.

I think that a policy applied at the state level is reason to cluster on state. Note that two-way clustering on state and year is also possible. Adding state fixed effects to your svy: logit model adjusts for state differences, but standard errors are still driven by within-state, between-PSU variation, rather than by variation between-states. I look at it this way: the analysis which clusters by state is akin to a randomized cluster design, with individuals sampled within clusters. In that design, inference is based on between-cluster variation, not on variation within-clusters induced by the sampling design.

I've designed and analyzed many survey samples, but have no experience with cluster-robust sampling outside of the survey context. , I think I've reached or, more likely, exceeded the limits of my understanding. I'll stop here, but welcome correction or by any of the econometricians on the List.

Last edited by Steve Samuels; 22 May 2018, 19:56.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Jamie Daw

Join Date: May 2018

Posts: 7
#7

01 Jun 2018, 13:14

James, I am planning to go ahead and cluster by state ID (option #2 in my code posted above).

Steve, thanks for your helpful comments on this. It sounds like you also don't have concerns about using an "if" statement to define the subsample rather than the svy, subpop() command?

I only have a subset of 15 states in my sample so I am still worried about the small # clusters. I have yet to be able to figure out how to implement a wild cluster bootstrap (which is recommended with small # clusters) with survey weights. Will post again if I figure it out.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

01 Jun 2018, 15:11

The subpop() option applies to analysis of a subgroup of units in a random sample. But I don't think that's your situation. (How did you choose them).

I do recommend that you and James try both approaches (states as strata, states as clusters) to see what difference it makes in the standard errors and test results--the coefficients should be identical.

If you want to quantify variation between states, compared to between-person variation, you could designate state as the highest level in a svy: mixed analysis (give them a nominal pweight of 1), then have psu_p and person as the lower levels. See the section on Survey data in the manual entry. I don't know how inference would differ from the cluster robust analysis.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

Small # Clusters Calculating SEs with Survey Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment