Svyset specification for pooled cross section data

Daniel Rodríguez Guio

Join Date: Sep 2014
Posts: 8

Svyset specification for pooled cross section data

29 Sep 2014, 15:52

Hi all!

I’m working with a pooled cross section data set which comes from a survey that is repeated monthly. I’m trying to specify the characteristics of my data set with the command svyset, but I’m not completely sure if I’m doing it well and I’d really appreciate if you could guide me a bit.

The survey collects data from individuals and each one has an individual expansion factor, say, one person represents other hundred people, another one could represent 150, and so on.

My data set comes from 2008 to 2013. For example, for one year I have something like this:

Obs.	Year	Month	Expansion factor (fxp)	X variables
1	2008	1	152	.
2	2008	1	68	.
3	2008	1	205	.
4	2008	2	120	.
5	2008	2	208	.
6	2008	2	89	.
7	2008	3	97	.
8	2008	3	134	.
…	2008	…	…	…
n-1	2008	12	35	.
n	2008	12	168	.

Where Xvariables are the variables which describe each observation (such as sex, age, city, among others).

I have the exactly same survey for every month since January 2008 till September 2013. I’m attempting to analyze the data as pooled cross sections, and I used the following Statalist post as a guide:

http://www.stata.com/statalist/archi.../msg00521.html

Like the post says, as samples are taken independently, I specify the year/wave as super-strata. Then my command is like follows:

egen monthXyear = group(month year)

[INDENT=2]svyset monthXyear [pw=fxp], strata(year)[/INDENT]
Where fxp represents the expansion factor of each observation.

I was wondering if the specification reflects what my data set is, and if it is necessary to specify some sort of estimation (jackknife, bootstrap, etc.).

Thanks for your help.

Tags: None

Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#2

30 Sep 2014, 02:51

This is (virtually) a duplicate post of http://www.statalist.org/forums/foru...s-section-data . Please continue discussion within that thread. You have not responded to the comments by Steve Samuels and me.
Comment
Daniel Rodríguez Guio

Join Date: Sep 2014

Posts: 8
#3

30 Sep 2014, 08:19

Actually it is and I truly appreciate your reply on both of my posts, Professor Jenkins. I realized my first post was not clear at all, that is the reason why I wanted to make a new post in order to focus the discussion on my data set and its specification, not on the regression.

I am pointing out what was the main issue of my data set on the first post so you could give me your advice.

I would like to thank you for your interest and your patience.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#4

30 Sep 2014, 09:10

It's unclear why you want to svyset the data sets. Tell us. To follow up, Steve Samuels's point: do you have any survey design information for each of the separate surveys (information about clustering and stratification)? If so, then he was suggesting that you use it. Maybe you don't have the information. Just because you are pooling a number of data sets doesn't necessarily mean that you should be using svyset. Depending on your analysis, you may want to take account of the year/month structure of the data set, but that's a different point. (That's what my previous remarks were about.)

PS please don't copy/paste sample data in the way you did. It's more concise and more legible to post such fragments between CODE delimiters (hit A followed by #): see the Forum FAQ.
Comment
Daniel Rodríguez Guio

Join Date: Sep 2014

Posts: 8
#5

30 Sep 2014, 14:19

The main reason why I think I should use svyset is that I have to settle that the data I have for one year are completely independent from the data I have in other one, even when the survey is exactly the same each year. I think I should not control the time (year) as a fixed effect in the sense that I don’t have necessarily the same respondents, even within each year sets, remind that the data have a monthly frequency. So, if I do it so (year fixed effect), I would be increasing my sample since I’d be treating my set of five years as a whole, and even more adding the expansion factor. In that regards, I am specially concerned about the standard errors in my estimation.

Looking at the methodology used by the data providing entity, I could say the data are clustered by the city of the respondent. Therefore, I think I could cluster the data by city and stratify them by the monthXyear variable, but still I don’t know if it’s right.

However, if you think that svyset the data is not necessary, what do you suggest could be the right way to analyze the data?
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#6

30 Sep 2014, 15:09

Sorry, but I find it difficult to understand all this from your explanation. Especially remarks such as those about "increasing sample size". And it remains unclear precisely what design information you have for each of the monthly cross-section surveys (key words in the documentation would be PSU and strata, or the equivalent in the language of the survey). Expansion factors are simply a form of "raking" weights -- if I understand correctly -- which 'gross up' the sample numbers to population totals. They are therefore a form of weight, though different in nature from design weights arising from differential probabilities of selection. In sum, I still think you need to answer Steve Samuels's questions and mine directly.

what do you suggest could be the right way to analyze the data?

I've already made some suggestions (referring e.g. to clustering) but my remarks were contingent on the sort of analysis that you wanted to do. Recall my reference to differences-in-differences.
Sorry, but I doubt if I can add more to what I've said so far -- which was inevitably speculative, given the information you provide (or don't provide).
Comment
muhammad akhtaruzzaman

Join Date: Dec 2018

Posts: 14
#7

27 Feb 2019, 19:59

Hi Stata users
I have pooled cross section data (randomly sampled 26,000 SMEs at different points of time (10 years)). For example, the SMEs were surveyed every year to know about the status of their investment in R&D and sales or productivity or profitability each year. The responses were like 10,000 SME said "yes" 10,000 said "no" 6,000 said " don't know". I want examine whether R&D investment has a strong relationship with productivity or sales. What type of regression model I would need to use for such data and how to convert these Yes, No, Don't know when using them in regression model. I look forward to hear your feedback. Thanks.
Comment

Announcement

Svyset specification for pooled cross section data

Comment

Comment

Comment

Comment

Comment

Comment