Proper survey weight specification

Casey Durand

Join Date: Oct 2014

Posts: 7
#1

Proper survey weight specification

09 Oct 2014, 11:21

Hi folks,

I am trying to determine the best way to incorporate survey weights into a logistic regression model. I had posted a similar question on Cross Validated a few weeks ago (http://go.uth.edu/mlmw). I'd like to restate my earlier question and see if anyone here has thoughts on the matter.

I am analyzing data from a household travel survey. There are three levels here: household, person and trip. After a household is selected, all persons in a household are asked to complete a travel diary, in which they record all trips taken over a 24 hour period. Survey weights are provided at all three levels. My dependent variable is at the trip level, while my independent variables come from the household, person and trip levels.

There are several issues that I am having a hard time with. The first is that the trip weight is in fact just the person weight (which itself is a product of the household weight and a person-level raking weight), multiplied by a "trip correction factor", which is basically a numerical value to correct for potential under- or over-reporting of trips. My concern is that if I just ran a -logit- model with the final trip weight, I would not be accounting for the clustering of multiple trips within each person. With the way the trip weight is constructed, is this true? Since my dependent variable is travel mode choice (e.g. walk, bike, drive) of each trip, you would expect it to be highly correlated across trips within person, and failure to account for this would be problematic. I have also looked into doing a weighted, multilevel model, with the thought that I could account for the clustering of trips that way, but discovered that Stata does not allow weights in -melogit- like it does with -mixed-.I have also tried doing an unweighted three-level model using -melogit-, but the model never converged (which would seem to indicate a weighted version would not converge as well, even if it were possible).

So given the above, is there a way to incorporate both the weights and the clustering of trips within person?

Thanks,

Casey
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

09 Oct 2014, 18:52

You are right that merely weighting the data will not account for clustering. However for survey data, the relevant clustering will not be by family, bur rather by "primary sampling units" (PSUs) the highest level sampled unit in the survey design. . It is variation between PSUs which determines standard errors. Accordingly, unless you are specifically interested in modeling family/person random effects, a multi-level analysis is unnecessary. Just svyset your data with the design information and do svy: logit. svyset also takes a strata() specification, which can reduce standard errors. All the information about PSU and strata should be in the study documents.

I don't believe that Stata has a way of doing a three-level logistic model which accounts for the survey design, including the sampling weights. The HLM program is often mentioned as an option, but I'm unfamiliar with it. I also don't know much about choice models or how they can be adapted to a survey design, so I hope that others will chime in.

Last edited by Steve Samuels; 09 Oct 2014, 19:20.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Casey Durand

Join Date: Oct 2014

Posts: 7
#3

10 Oct 2014, 11:59

Thanks Steve. This raises another issue for me. I am used to working with data sets that supply very clear pweights, PSUs and sometimes strata. However, this particular data set only provides final sampling weights at the household, person, and trip levels. In 500 pages of study documentation, there is no mention at all of PSUs. My guess is that it would be at the household level, but I can't say for sure. I thought I must be missing something or they were labeled oddly, but then I went ahead and -svyset- the data using only the pweight. Looking at basic descriptives, like number of vehicles in a home, or mode splits (i.e. percent of all trips taken by car, bus, walking, biking, etc), I am able to exactly replicate the weighted values provided in the study's summary report. This holds for data at all three levels. This is all to say that I don't think I am missing/overlooking any relevant design variables. Are you familiar with how design features you might expect to have to explicitly specify, like PSUs and strata, are incorporated in the final pweight when that is the only thing provided? And then given this structure, and the fact that my dependent variable is at the trip (lowest) level, while my predictors come from all three levels, which of the three pweights would I specify?
Comment
Isaac Maddow-Zimet

Join Date: Apr 2014

Posts: 70
#4

10 Oct 2014, 12:52

Hey Casey,

My understanding is that incorrectly specifying the PSUs (or really any aspect of the survey design except the weight variable) won't effect the point estimates -- only the standard errors. So the fact that you are matching descriptive statistics in the published report doesn't mean anything except that you are specifying the weight correctly (unless you are able to match their standard errors or confidence intervals as well).

I will leave it to more informed commenters as to what you should be setting as your PSU or strata - though from what you say, it does seem like the PSU is the household.

Hope this helps!
Isaac
Comment
Casey Durand

Join Date: Oct 2014

Posts: 7
#5

10 Oct 2014, 14:40

Thanks Isaac. I am able to match the confidence intervals using only the final weights.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

12 Oct 2014, 13:35

PSU and stratum identification are not part of the computation of probability weights. It's possible that house-holds were sampled from address-list frames, as was done in the 2010-2012 California Household Travel Survey (http://www.dot.ca.gov/hq/tsip/FinalReport.pdf). In that case, sampling is single-stage but multi-frame. That survey did have geographical/ethnic sampling strata, but omission of strata would not affect standard errors much, especially if the stratum information appears as covariates for regression equations.

If sampling was from address-based lists,then family is the proper PSU. So, if you use

Code:

svyset family [pw = ]

and substitute the appropirate weight for your analysis (trip weight for analysis of trips as unit), you can use any command that accepts a svy: prefix. You can find these by typing

Code:

help survey estimation

You can estimate parameters for random effects in a weighted multi-level model with the contributed package gllamm (Generalised linear latent and mixed models).
The manual can be found at http://www.gllamm.org/docum.html, and findit in Stata and Google will lead you to many additional references. gllamm allows you to specify a cluster variable, which in your case, should be "family". I believe, but have not checked, that this can be the same as the highest level in a multi-level model.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

14 Oct 2014, 18:04

I am able to match the confidence intervals using only the final weights.

In that case, the published report is incorrect for any plausible sampling design. What did the report say about the design?

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29971
#8

14 Oct 2014, 18:41

I have seen public-use survey data sets where only the weights were provided, and the strata and psu's withheld for "confidentiality/privacy" reasons. While I am skeptical that releasing strata and psu, numerically coded, would really materially breach anybody's privacy, withholding them is not an uncommon practice in some agencies. As Steve Samuels notes, it really means that correct inferential analyses cannot be done, so I'm not sure what the point of releasing such data sets is, but they do it anyway.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

14 Oct 2014, 19:44

A poster to Statalist last year encountered this problem with data distributed by the Gallup organization (http://www.stata.com/statalist/archi.../msg00166.html). So, apparently, did employees of the Gallup organization, who published bounds of error that ignored the survey design (http://www.gallup.com/poll/161675/re...donesians.aspx),

Last edited by Steve Samuels; 14 Oct 2014, 20:02.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Casey Durand

Join Date: Oct 2014

Posts: 7
#10

16 Oct 2014, 09:53

Steve, you correctly guessed the data set I am using. It is the 2010-2012 California Household Travel Survey, so the sampling is from a list of addresses with 30 strata across the state. It is interesting how little difference the specification of the PSU (assuming it is in fact household) and/or strata makes to the standard errors. For example (sampno is the household identifier):

Code:

svyset sampno [pweight= exphhwgt]

produces exactly the same standard error as

Code:

svyset [pweight= exphhwgt]

when issuing the following estimation command

Code:

svy, subpop(if hhsiz==2): mean htrips

Finally, specifying the following

Code:

svyset sampno [pweight= exphhwgt], strata(strata)

yields differences only at around the fourth decimal place when using the same estimation command. This result appears to generalize beyond this specific example.
Comment
Casey Durand

Join Date: Oct 2014

Posts: 7
#11

16 Oct 2014, 10:47

Actually, you can probably disregard the above example. While it is all true for household level variables, the addition of the PSU (households) does change the standard errors when looking at individual-level variables (using the person-level weight of course), which I think would be consistent with what you would expect.
Comment

Announcement

Proper survey weight specification

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment