Svyset for pooled cross section data

Daniel Rodríguez Guio

Join Date: Sep 2014

Posts: 8
#1

Svyset for pooled cross section data

26 Sep 2014, 15:56

Hi all!

I’m trying to specify the characteristics of my data set with the command svyset, but I’m not completely sure if I’m doing it well and I really appreciate if you can guide me a bit. I’m working with an independent monthly household survey. The survey collect data from individuals and for each date (year and month) I have an individual expansion factor, where, one person represents other hundred people, another one could represent 150, and so on.

I have data from 2008 to 2013. I am trying to use it as a pooled cross section data set, and I want find the effect of a law about extending the weeks of maternity leave that was implemented on July 2011 on job market variables. I only use women and I have a treatment group: women between 20-30 years old and a control group: women 40-50.

I’ve been using this code:
egen monthXyear = group(month year)
svyset monthXyear [pw=fxp], strata(year)
Where fxp is the variable containing the expansion factor for each observation.

And my estimation is something like this:
xi: svy : reg DEPENDENT_VAR law2011##fertile CONTROLS i.month i.year
Where law2011 is a dummy that takes 1 after the law was implemented and fertile is a dummy that takes 1 for the treatment group.

I was wondering if the specification reflects what my data set is, and if the regression is well written to get the effect and solve the problem with the standard errors. I was also wondering if it is better to use svy bootstrap for my type of data.

Thanks for your help.
Tags: None
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#2

27 Sep 2014, 03:56

I assume you are using Stata 13 (see Forum FAQ). Whatever else you do, remove the xi: prefix. It is redundant. Modern factor variable notation makes it so. In addition, please explain in greater detail why you think you should be using a svy set-up. (And please confirm that you have repeated cross-section surveys, not a true panel. The word "independent" is not conventional.) From what you write, it appears that have a variation on what economists would call a differences-in-differences design, and the issue that can come up is: how should I account for potential clustering? In your case, this would be by year, I think. If I've interpreted you correctly, use of an appropriate cluster option would be what you'd use, not svy. Or year fixed effects.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

27 Sep 2014, 13:22

Your post indicates some basic misunderstandings of survey concepts like primary sampling unit and strata. Like Stephen, I doubt that surveys were "independent"; this would mean that a new survey of the entire population was drawn each month..Of one thing I am quite sure, year is not the primary sampling unit (PSU), nor is yearXmonth variable. Moreover, the choice of a control group of women 40-50 for an intervention group of women 20-30 is wrong on substantive grounds: the fertility and pregnancy patterns in older women are very different from those of younger ones. Women of the same-age should be compared before and after the intervention, the difference-in-differences analysis that Stephen refers to. More basically, I urge you to study a survey-sampling text (I like Sharon Lohr, 2009, Sampling: Design & Analysis) and themanual entry for svyset. A survey of this type is likely to have study documents that describe the survey design and that give guidance fthese documents will in many instances be online. When you repost describe the survey as those documents do and provide a link, ifavailable.

Multi-year and panel surveys present difficulties for even experienced analysts. I suggest you learn how to correctly set-up and analyze the data from one survey year before proceeding further.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Daniel Rodríguez Guio

Join Date: Sep 2014

Posts: 8
#4

06 Oct 2014, 07:26

Thank you both for your replies and, please, excuse me for the delay of my post.

About why I think I should use svyset. I have a pooled cross section set which comes from a survey that provides monthly information about 13 cities in Colombia. I have the exactly same survey for every month since January 2008 till September 2013; then I have 13 PSUs over 57 months. I think if I do not specify this issue in Stata, I would have about 741 PSUs, what would give me smaller standard errors and p-values.

On the other hand, thank you Mr. Samuels for all your observations. It was a clear misunderstanding about what a PSU is. I have seen the methodology and the design of the survey. The PSU they take is a city or metropolitan area; then, they take the SSU as a block that contains about 12 segments (each segment consist of 10 households), and the TSU would be a segment where all households and all people within each household are surveyed.

Mr. Samuels, I have seen some of your other posts about this issue, and, according with them, I wonder if this command could be right.

Code:

svyset city [pw=fxp1], strata (date)

Where fxp1 is the expansion factor divided by the 57 months.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

06 Oct 2014, 18:33

Well, we are a little closer, but you still haven't given enough detail.

I have some questions about block selection., and please answer exactly. I 'll have further questions that depend on your answers to the current ones.

1. Were the same 13 cities/metropolitan areas surveyed in all survey years?

Taking just one survey year and one city/metropolitan area (you choose, but tell us their identitie):

2. How many blocks were there in the city?

3. How many were selected for the sample in that year?

4. What was the method of selection?

5. How many were studied in each month?

6. If the answers to 4 & 5 are different, describe exactly how the blocks were allocated to months?

7. Was a different sample of blocks selected from that city in other years?

8. Was the same sample of blocks used in more than one year?

9. Could individual blocks appear in other years?

10. Was the process of block selection similar in other years and other cities?

If, you can, attach a copy of the study document that describes the design (see the symbol like a paper with paper clip to the left of the "A" in the header for your post.), or provide link (hit the A and use the symbol that looks like chain link.)

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Daniel Rodríguez Guio

Join Date: Sep 2014

Posts: 8
#6

07 Oct 2014, 13:18

Yes, the 13 cities were surveyed in all survey years.

According with the data providing entity, they take a fixed set of blocks, by a systematic sample, which are to be sampled each year. From these fixed blocks, by a random sampling method, they survey just one segment, which means about 10 households. That process is the same in every city in every year.

For example, if we take Bogotá, they say, the city has about 40.000 blocks where 230 are sampled, again systematically selected. So we could say, the blocks which are selected are studied indeed, and what changes is the segment surveyed each month, which is selected randomly.

As the blocks are fixed, I could say that there was not a different sample of blocks selected for any city in each years, and that those blocks were used more than one year.

Regarding the document, it is in spanish, but still here it is: https://www.dane.gov.co/files/invest...gia_GEIH13.pdf
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

07 Oct 2014, 17:12

Your description of the sampling doesn't correspond to what the study document says (in a Google translation). It is true that 13 major cities, with their metropolitan areas, have been studied continuously since 2000 (page 6, pp 16-17). However by 2006, the study had been enlarged to include 11 other cities and a sample of rural areas. (pp. 16 -17). Is your study limited to the 13 original major cities. does it include observations from other localities? If you are including others, please be specific about which they are, e.g. every locality, just the 11 cities.

Last edited by Steve Samuels; 07 Oct 2014, 17:50.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Daniel Rodríguez Guio

Join Date: Sep 2014

Posts: 8
#8

08 Oct 2014, 08:51

Yes, I have the data of the 13 major cities (It does not include observations from other localities) from 2008 till 2013 and my study attempt to analyze what happen with them. In that sense, I am not concerned neither about the changes of survey in 2006 nor the survey for rural areas. I have described exactly the data set I have and with which I am working on.

So I was wondering, according with the data set I have, how I should specify the survey’s characteristics in order to have robustness and reliable estimators.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

09 Oct 2014, 07:06

Thanks for that information. Limiting yourself to the 13 cities certainly makes sense for your study question and it greatly simplifies the svyset.

In the context of the whole survey, the 13 cities are self-representing PSUs. But for your study, they become sampling strata. Thus the designation of sampling stages shifts down.

The first units to be selected by random numbers are what the document calls USMs, secondary sampling units. as you said. These are not blocks, but groups of adjacent blocks big enough to be divided into 12 segments (TSUs or tertiary units) of 10 adjacent households. These USMs beccome the Stage 1 or Primary Sampling Units for your study. The TSUs are Stage 2 units formally and are assigned randomly to months (whether to >1 month per year is not clear from the document).

Your study goals are to test a hypothesis about the new law and to estimate its effect, with adjustment for age. This is an "analytic" or "causal" study, as opposed to a "descriptive" one. Therefore finite population corrections should be omitted. (See: http://www.stata.com/statalist/archi.../msg00075.html .) There is no benefit to specifying further sampling stages, because without an fpc, Stata will ignore later-stage information in the computation of standard errors.

The document is not clear about the persistence of the USMs and segments in different years, but the following svyset statement will cover all needs. Let usm_id identify the USMs.

Code:

egen new_strat = group(city year) svyset usm_id [pw = fxp], strata(new_strat)

This svyset statement will permit analysis of single years and of one or more individual months, necessary because the law took place in the middle of a calendar year. I don't see a reason to rescale the sampling weights as you did earlier (divided by 57). Rescaling will not affect any estimates of effect, including, for example, differences in means or slopes. Rescaling could even be incorrect for a monthly analysis; as only about 1/12 of respondents are interviewed each month, their weights are naturally scaled.

Note that it would be an error to include month in the strata() option. That would imply that a new sample of the first stage units (USMs) was selected each month.

Last edited by Steve Samuels; 09 Oct 2014, 07:56.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Daniel Rodríguez Guio

Join Date: Sep 2014

Posts: 8
#10

09 Oct 2014, 10:03

Thank you for all your help!
Comment

Announcement

Svyset for pooled cross section data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment