svy versus [pw=...] for regression analysis

Lina Massou

Join Date: Jul 2015

Posts: 41
#1

svy versus [pw=...] for regression analysis

18 Nov 2015, 05:50

Hello everybody,
I have data from household budgets and the only available weight is a pweight. I run regression analysis and I see that running this command

regress HEALTHRT i.MB02 i.AGEGROUP2 [pweight = HA10] if HA02==2008

has different results compared to these

svyset [pw=HA10]
regress HEALTHRT i.MB02 i.AGEGROUP2 if HA02==2008

as for the coefficients. At p-values there are also differences, not as for the significance but as for the values.
Which method is more appropriate?
I saw a previous post but they talked about subpopulation and I didn't get it.

Many thanks,
Lina
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

18 Nov 2015, 06:07

Neither is correct if you have data from a multistage sample, so please describe the sampling design in detail. Was a different sample taken in each year? Quote from the study documentation or link to it, and we'll then construct the correct svyset statement and advise about subpop().

In addition, we can't see the difference you are talking about. Please follow the directions of FAQ 12, which, in part, are to 1) show not only commands, but all the results from those commands; 2) to put commands and results between CODE delimiters, which are explained. As it is, your second statement had no svy: prefix and ignored the (possibly incorrect) svyset statement

Last edited by Steve Samuels; 18 Nov 2015, 06:38.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Lina Massou

Join Date: Jul 2015

Posts: 41
#3

18 Nov 2015, 06:55

Thanks for this. Well, my dataset has the expenditures on health and goods since 2008.

I have a variable (HA10) that gives the sample weight and it is the only information that I have about the sampling design.
HEALTHRT is the expenditure on health care, MB02 the gender of the household head, Agegroup2 the age group that he is included, Marrital the marital status, education level the next one, tworegions a dummy variable for 2 regions that I have created from the dataset, ME01_AGGREGATED the ses, household size the household size, QUI_RT_CONSUM the quintile of consumption, and HA02 the year of survey, I want to analyze the dataset year by year, from 2008 to 2014.
Typing

Code:

svyset [pw=HA10] regress HEALTHRT i.MB02 i.AGEGROUP2 i.MARRITAL i.education i.tworegions i.ME01_AGGREGATED i.Household_size i.QUI_RT_CONSUM if HA02==2013

I receive the results of pic1.
While typing

Code:

regress HEALTHRT i.MB02 i.AGEGROUP2 i.MARRITAL i.education i.tworegions i.ME01_AGGREGATED i.Household_size i.QUI_RT_CONSUM [pweight = HA10] if HA02==2013

I take results of pic2.

Could you help me please?

2 Photos
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4994
#4

18 Nov 2015, 07:14

Just svysetting the data isn't enough. You have to use the svy: prefix on commands, e.g.

Code:

svy: reg y x1 x2

I think that will resolve the discrepancies. If not post your output again. It is better to use code tags when doing so. See pt 12 of the FAQ.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

18 Nov 2015, 08:31

Thanks for using CODE delimiters, but FAQ I2 asked that you put results, not just commands, between CODE delimiters. It also asks that you not post photos. They block the screen and, in your case, are too fuzzy to be readable.

Do you want to quote standard errors, confidence intervals, or p-values? Do you plan on comparing different years? Different countries? If the answer is "yes" to any of these questions, knowing something of the design is crucial.

Your naming of the HA10 and HA02 variables was enough to identify your survey as an EU Household Budget Survey (HBS). I found some overall description in these two documents (I didn't look for more).

2008:

http://ec.europa.eu/eurostat/documen...2-db5d330f27aa

2010:

http://ec.europa.eu/eurostat/documen...4-757d4342015f

These indicate that design varied from country-to-country, but you should be able to pick out _your_ country (if you are studying a single country) and get a description that way. See Annex 2 (p. 31+) of the 2008 document and Appendix 2 (p. 52+) of the 2010 document. Also of some importance is the "effective" sample size of Table 2, page 8 of the 2010 document. So report, verbatim, what those sections say about your country.

Last edited by Steve Samuels; 18 Nov 2015, 08:42.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Lina Massou

Join Date: Jul 2015

Posts: 41
#6

18 Nov 2015, 08:55

Dear friends,

first of all, please accept my apologies for my mistake with the attached photos, but I didn't know how to add the results between CODE delimiters, do you mean copy-paste?...I'm really sorry for this.
Next, Mr Williams was right, the results are the same using svy before regression. As for the HBS, thank you for the material, but I couldn't find how to use it in my regression analysis.

Many thanks for your support and your immediate response.

Best,

Lina
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#7

18 Nov 2015, 09:18

Lina:
as far as code delimiters are concerned, here is an excerpt from FAQ #12 (that you're recommended to read):

Stata code (i.e. the exact commands issued) is very much easier to read if presented as such. Click on the “Toggle Advanced Editor” button (an underlined A) in the area above where you enter text for posts and, in the menu that appears, click on the # button to insert

Code:

and

mark-up. Write your code between, paying particular attention to linebreaks and indentation. Or just insert those mark-ups manually before, or indeed after, you insert your code.

Kind regards,
Carlo
(Stata 19.0)
Comment
Lina Massou

Join Date: Jul 2015

Posts: 41
#8

18 Nov 2015, 10:45

thank you!
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

18 Nov 2015, 11:14

I asked you to share the documentation so that I might advise you on what to do. Without the information I requested, I can only say that standard errors, confidence intervals, and p-values from svy: regress may all be invalid.

Last edited by Steve Samuels; 18 Nov 2015, 11:30.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#10

19 Nov 2015, 01:45

The best practical discussion I've found on correctly specifying -svyset- is here:

http://siteresources.worldbank.org/I...quityFINAL.pdf

However, there are cases where the survey documentation is quite sparse, making it very challenging to completely specify the sampling.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#11

19 Nov 2015, 06:08

In the absence of PSU and Stratum identifiers, I would use the 2010 DEFFs to modify e(V). It's only a rough fix for standard errors, but probably better than doing nothing.

Code:

cap program drop _all program define udeff, eclass args deff matrix b = e(b) matrix V = `deff'*e(V) ereturn post b V end sysuse auto, clear svyset _n [pw = rep78] svy: reg price weight mpg di _se[weight] test weight /* Apply DEFF = 2 */ udeff 2 di _se[weight] test weight

Last edited by Steve Samuels; 19 Nov 2015, 06:21.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Lina Massou

Join Date: Jul 2015

Posts: 41
#12

19 Nov 2015, 09:39

Thank you for your help.
As for the documentation of weights I write here the guides for the authority that carried out the survey:

" For the survey features estimation, the characteristics of each individual as well as these of each household were multiplied with a reducing coefficient that created as the product of these 3 factors:
1. the reverse probability of choosing this individual (this is equal to the reverse probability of choosing the specific household)
2. the reverse of the ratio of response of household in this strata
3. a correction coefficient. "

Is this helpful?..Because I find it even more complicated.

What happens with the heteroskedasticity if I use svy or pweights in my regression?
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#13

20 Nov 2015, 18:30

It is complicated but it is not what I asked for. I asked for a description of the sampling design in your country and linked to specific places where you might find it. (There might also be in the document you quoted.) The regressions you've done so far assumes that households were selected at random from a list. That's not necessarily true, as many studies select geographic areas first, then households within those areas. This is called multiple-stage sampling. The larger areas are "clusters" of households, known as"primary sampling units" or PSUs, and it is variation between PSUs which is the primary determinant of standard errors. If your survey had multiple stages, but PSUs are not identified in your data, then you cannot write down a correct svyset statement and will have standard errors that are much too small. That is why it is important that you report the sampling design pertinent to your country, especially the "Design Effect" (DEFF) on p 52 of the 2010 document. You can look up the definition of the DEFF in the Stata survey manual.

To answer your question about heteroskedasticity: survey regression works with heteroskedastic data. The standard error computation does not assume a constant residual SD.

So I'm now curious: what is the country or countries you are studying.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

svy versus [pw=...] for regression analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment