Poisson Regression with Robust Variance in National Survey Data

Diana Rodriguez

Join Date: Jun 2014

Posts: 2
#1

Poisson Regression with Robust Variance in National Survey Data

24 Jun 2014, 09:12

This post has to do with estimating relative risk using "glm" for common outcomes in cohort studies as mentioned by the UCLA Statistical Consulting Group. According to this post (found below) you can use Poisson regression with robust error variance to obtain the relative risk in survey datasets where the outcome of interest is NOT rare. Poisson regression can be done to directly estimate the prevalence ratios of interest, as the odds ratio can overestimate the risk ratio when the outcome of interest is common (Behrens et al., 2004).

I am currently working on a new project that uses data from the National Health Interview Survey to observe the relationship between diabetes and certain risk factors. In order to account for the complex survey sampling and sample weights I am using the "svyset" command and then the "svy" command followed by "subpop" to establish the variance estimation and sampling weights before running my commands. Since my outcome, diabetes, is not rare in my sample population, I would like to use Poisson regression with robust error variance to obtain the relative risk.

Unfortunately, some standard options are not allowed with the "svy" prefix, with "vce(robust)" being one of them. I cannot set "vce" to "(robust)" because "svy" is already using the variance estimation and sampling weights identified by "svyset". I was wondering if anyone knew of a way to obtain the relative risk when using national survey data and sample weights.

Should I use regular logistic regression to obtain the odds ratios? Should I continue with Poisson regression without setting the "vce" to "(robust)" and use this to obtain the relative risk? Or should I use another completely different approach? I have been looking through the literature and most studies simply use regular logistic regression. Any advice would be greatly appreciated.

Stata FAQ: How can I estimate relative risk using glm for common outcomes in cohort studies?

http://www.ats.ucla.edu
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 5008
#2

24 Jun 2014, 09:43

I think "robust" is inherent in svy, at least when using the default options. p. 6 of the svy reference manual says

Stata’s suite of survey data commands is governed by the svy prefix command; see [SVY] svy and
[SVY] svy estimation. svy runs the supplied estimation command while accounting for the survey
design characteristics in the point estimates and variance estimation method. The available
variance estimation methods are balanced repeated replication (BRR), the bootstrap, the jackknife,
successive difference replication, and first-order Taylor linearization. By default, svy computes
standard errors by using the linearized variance estimator— so called because it is based on a
first-order Taylor series linear approximation (Wolter 2007). In the nonsurvey context, we refer to
this variance estimator as the robust variance estimator, otherwise known in Stata as the
Huber/White/sandwich estimator; see robust.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2174
#3

24 Jun 2014, 16:23

Richard is correct. The variance matrix estimator with the svy option is always of the sandwich form, which accounts for the survey sampling and is automatically robust to violations of the distribution assumption (Poisson in this case).
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#4

26 Aug 2014, 08:45

Hi Diana,

It sounds like you have your analysis well under control, but just FYI there's some debate about the 'best' alternatives to logistic depending on how common your outcome is:

http://www.biomedcentral.com/1471-2288/8/9
http://www.biomedcentral.com/1471-2288/3/21/

http://www.stata.com/statalist/archi.../msg00755.html

cheers

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#5

26 Aug 2014, 11:04

Sorry to come late to the party; I see the original question has already been answered. Indulge me while I digress on the rationale for the question.

I doubt I will ever understand why it is so widely held that risk ratios are the "right" way to measure strength of association, and odds ratios are somehow a kludge, acceptable only when the baseline risk is rare enough that the OR and RR are approximately equal. (And I understand even less why my epidemiologist colleagues are among the most vehement advocates of this view.)

While the meaning of a risk ratio is perhaps more intuitively clear at first glance, once accustomed to working with odds ratios, they are easy to grasp. Moreover, odds ratios are in some ways a more natural metric. Whereas risk ratios are necessarily bounded from above by the reciprocal of the baseline risk, odds ratios always range between 0 and infinity. Furthermore odds ratios arise naturally in many situations such as predicting group membership from an observed variable that has (different) normal distributions in two groups.

When all is said and done, odds ratios and risk ratios are two, somewhat related, ways of quantifying associations between variables. Each has its own strengths and its own drawbacks. And I do not see any reason why either should be considered the ideal, and the other a cheap substitute we must sometimes settle for.
1 like
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#6

26 Aug 2014, 17:45

OK, I'll bite- I hope Diana doesn't mind us high-jacking the thread for pedagogical banter!

I think there are two main reasons that ORs are looked upon with suspicion; one is practical, and the other statistical.

1.) A major consideration in public health and epi is need to convey research results to non-specialists- the medical community, general public, and policy makers. Statalist posters and bookies aside, odds and especially odds ratios have very little intuitive relationship to the things people really care about- the risk or probability of an event occurring.

2.) Stratification is the bread-and-butter of epidemiology, but ORs can display the highly unfortunate property of non-collapsibility across strata: the overall observed OR is not the weighted average of the stratum-specific ratios, so ORs in all strata can be > the overall OR. As addressing confounding is perhaps THE major issue in designing epi studies, knowingly introducing a metric with similar features seems like asking for trouble.

Sander Greenland has a slew of articles on this topic, FWIW.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Nazzarena

Join Date: Aug 2014

Posts: 60
#7

22 Sep 2014, 16:30

I apologize in advance, I am also somewhat off topic, but the post prompted me to ask questions re: glm(poisson) vs poisson commands.
The command glm in this example is used on long form cohort data as opposed to count or count-time data.
Would one be able to use the poisson proper this way? Can one think about this data as composed of very thin "strata" with a count of 1? Would one have to convert to count data? Covariates then refer to "count strata", not to the individual subject. While this may be (at least in theory) trivial with categorical covariables, what does one do with the continuous ones?
Also what about time at risk? This example refers to RR, but it concern specifically prevalence ratios: could this procedure be used to estimate IRR? where is the equivalent of the exposure option that acccounts for person-time?
Thanks!

N
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#8

22 Sep 2014, 22:17

Your confusion is understandable- the literature is a bit of mess in this area The discussion above relates to a fairly specialized model where -poisson- or -glm fam(poisson)- with robust errors have been adapted to analyze a binary outcome without consideration of time-at-risk; these are the same general models as -binreg- with different parameterization. The original author calls these 'Modified Poisson' models (see doi: 10.1093/aje/kwh090), others use "robust Poisson" (as in the BMC links above), and these can provide either risk ratios or prevalence ratios.

A further wrinkle is using Poisson models to analyze survival data that have been split at failure times (see -stsplit-), collapsed, and analyzed as Poisson models (excellent tutorial on Paul Dickman's website, http://www.pauldickman.com/survival/labs.pdf); these are asymptotically equivalent to Cox PH models, and so explicitly consider time-at-risk.

All that said, if you have counts or rates (counts/time) from a Poisson (or -nbreg-) model then you've got IRRs, and all of the above is likely not relevant to your analysis.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
1 like
Comment
Nazzarena

Join Date: Aug 2014

Posts: 60
#9

23 Sep 2014, 11:17

Thank you very much for your swift and courteous answer. I surmise it is indeed not, for absent an option to specify exposure you cannot use glm(poisson) to estimate IRR. My other -perhaps even less pertinent issue- is that collapsing individual data into count-time strata is 1)not trivial with a lot of covariates, ie, very thin strata with maybe zero counts 2)impossible with continuous covariates unless you take some sort of summary measure for each strata, but then you misestimate the effect of the covariate?

For example suppose there actually is time at risk in the example above (a pseudorandom number of years lets say up between 0.7 and 3.8)

use http://www.ats.ucla.edu/stat/stata/faq/eyestudy, clear
set seed 1093
generate double time = (3.8-0.7)*runiform() + 0.7
sum time
collapse (sum)time (count) lenses (mean) meanlat=latitude, by(gender carrot)

would one fit a poisson lenses carrot gender meanlat, exposure(time) irr or some such?
could one fit a glm(poisson) instead, and what exposure/offset options are available?

thanks and pardon list etiquette breaches.
Comment
Nazzarena

Join Date: Aug 2014

Posts: 60
#10

23 Sep 2014, 14:25

My apologies, glm can indeed take the very same "exposure" option, I found a snippet of code employing such. So if
poisson lenses carrot gender meanlat, exposure(time) vce(robust) irr
were correct, then
glm lenses ib1.carrot ib2.gender meanlat,exposure(time) fam(poisson) link(log) nolog vce(robust) eform
would be correct also. Except either or both are wrong, and it's not (just) because of the continuous covariate

Moreover, the depvar in this usage of glm-fam(poisson) needs to be a count, whereas in the UCLA "modified poisson" example it is a binary outcome.
Sorry, I understand glm is most flexyble, I just cant wrap my haed around how the function can work with two radically different arguments

Thanks again
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#11

23 Sep 2014, 18:46

Two parts to that- one is that -glm- is by definition a general framework for models that can handle a wide variety of link functions. Secondly, a 0/1 outcome is "Poisson-like" (or perhaps a very truncated Poisson), so the -robust- option deals with the deviation from an ideal Poisson process.

Also, good to read the FAQ and change your username to your real name, as is the norm here on Statalist.

Last edited by Andrew Lover; 23 Sep 2014, 19:00.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
1 like
Comment

Announcement

Poisson Regression with Robust Variance in National Survey Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment