Survey data- Pool Cross Sectional data with weights- What to do with weights?

Lars Pete

Join Date: Nov 2020
Posts: 118

Survey data- Pool Cross Sectional data with weights- What to do with weights?

29 Nov 2020, 01:11

Dear all,

I have a pooled cross sectional survey data ( weakly balanced, 9 different years wealth and income data for individuals belonging to 4 races, 2 genders, and three categories of education) I have been referring to the STATA guide about this. There is a whole pdf manual about about svy.
What I saw was that when we do svy, we don't need to do any calibration of the variables using the weights given in the data. For instance we don't need to calculate weighted mean or anything (like X1W1+X2W2..../W1+W2+....). We just specify the survey design using svyset and use use "svy:" before putting any estimation command. The data somewhat looks like this- Each year has multiple individuals belonging to a each race category and probably there are different individuals in each year, not the same individuals sampled every year. I think this is sampling with replacement (since new set of individuals in each year? or not?How to know?), but I am not sure.

weight	Year	Race	Gender	Age	Income
1678.4	1990	Black	M	55
.2	1990	White	M	25
6546	1990	Black	M	44
151.55	1990	White	F	56
564.55	1991	White	F	60
54.66	1991	White	M	30
1483.08	1991	Black	M	29
452.6	1991	Black	F	48
111.56	1992	White	M	65

My questions are -
1. How to know if our data is sampling with replacement or sampling without replacement?
2. How to know if it is a one stage design or a two stage design? (This is necessary to know for when we specify survey design)
2. Which approach is better for plotting trends in income? Using svyset, doing regressions and marginsplotting OR without svyset, by calculating weighted means and doing line plot OR collapse (p50) income [pw=weight], by(year race)?
3. If we use svyset, how to know the type of weights? aweight, pweight...etc?
4. svyset PSU [pweight=pw], strata(strata)
Is it alright if I use year as strata?
5. What is the primary sampling unit here? I think each individual?

Any help would be greatly appreciated.
Thank you.

Last edited by Lars Pete; 29 Nov 2020, 01:55.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

29 Nov 2020, 11:31

Questions 1, 2 (the first one), 3, and 5 must be answered by referring to the documentation that accompanies the survey data. If you were simply handed a data set with no documentation, there is no way to answer these crucial questions: you cannot tell by examining the data itself. In that case you will have to contact the source of the data and request the documentation of how the survey was designed.

With regard to question 4, it is possible that year is a stratum in the survey design, but it would be extremely unusual. With very high probability the answer is no. Identifying the proper stratum variable is like the questions I referred to in my first paragraph: you have to get that information from the survey documentation.

Regarding the second question 2, if you are only interested in the mean values, and not in knowing how precise these estimates are, then all three of these approaches should get you the same result. If you need estimates of precision/uncertainty, the only the -svyset- -svy: regress- -margins- approach will do. Frankly, though, the -svyset- -svy:- approach is also the easiest to do, so I would go with that anyway.
Comment
Lars Pete

Join Date: Nov 2020

Posts: 118
#3

29 Nov 2020, 16:41

Hi Clyde,

Thank you for your reply and sorry for the jumble up in numbering the questions. I will try to be very specific here.

Suppose there is no documentation available, and suppose we are trying to plot the trends in median income, would svy approach and margins-plotting still be better than
collapse (p50) income [pw=weight], by(year race) and margins-plotting? I am getting different results.
_____________________________________

Interestingly, when I do collapse (p50) income [pw=weight], by(year race), STATA gives error and is asking me to specify the weights as aweights instead of pweights.
So, I do collapse (p50) income [aw=weight], by(year race), which works!

Then I do:

regress wealth year#race
marginsplot.

_____________________________________

In another case,

When I do specify svyset, there is no error with pweights and I am able to do svyset id [pweight=weight], strata(strata) where I am specifying id as the Primary Sampling Unit.

I first do:
bysort year race: egen med_income = median(income)

Then I do:

svy: regress med_income year#race, baselevels
margins year#race
marginsplot

_____________________________________

These two approaches are giving me different results. Are they both correct? Which one is better?
____________________________________

Another question which is less important than the above question but if you could answer. STATA does give me the graphs using marginsplot.
However, if I simply do:
twoway OR graph twoway line wealth year if race==1 || ///
line wealth year if race==2 || ///
line wealth year if race==3 || ///
line wealth year if race==4 ///

STATA doesn't give any error. However, the graph isn't showing up! The graph window doesn't open but it does with margins-plot. I don't know what's going on here. Perhaps you could answer.

Thank you.

Best,
Lars.

Last edited by Lars Pete; 29 Nov 2020, 16:43.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#4

29 Nov 2020, 18:53

In #1 you spoke about means, but you have switched to medians. That is a very different kettle of fish in this context.

I do not know why -collapse (p50) income [pw=weight], by(year race)- is not working for you. There is nothing wrong with it. Perhaps you are using an older version of Stata? In some contexts aweights and pweights are interchangeable, but not always and I would not trust it. If your version of Stata does not allow -pweights- with -collapse (p50)- I would not trust a substitute calculated with -aweights-. It might be OK, but perhaps not. I don't know.

Then I do:

regress wealth year#race
marginsplot

This makes no sense to me. Following your -collapse- command you have one obswervation for each combination of year and race. So your "regression" is basically just copying that one observation from each group. So skip the -regress- and -marginsplot- here and just directly graph the data in whatever way you want. For example, -graph twoway connect wealth year, sort by(race)- or something like that.

bysort year race: egen med_income = median(income)

No, you can't do that. -svyset- just records the design information in the data set and makes it accessible to other commands. But other commands do not use that data unless they are told to by the -svy:- prefix. When you are working with data that has been -svyset-, you must use the -svy:- prefix with any command to do calculations. So the medians being calculated there are unweighted medians and they are just plain wrong. Unfortunately, you can't fix that by using the -svy:- prefix here because -egen- does not support the -svy:- prefix. For descriptive statistics with survey data using the -svy:- command you are pretty much limited to proportions and means, not medians or other rank statistics.

And again, even if that approach with -egen- worked, the follow-up with -regress- and -marginsplot- makes no sense.

However, if I simply do:
twoway OR graph twoway line wealth year if race==1 || ///
line wealth year if race==2 || ///
line wealth year if race==3 || ///
line wealth year if race==4 ///

Yes, this is what you should be doing if you want all four lines overlaid in a single panel. Well, with one correction: eliminate the /// at the end of the final line of the command so that your command does not appear to Stata to run on into whatever is on the next line.

I do not know why the graph window is not opening when you try to do that. Maybe it's because of the /// in the last line. See if removing that fixes the problem. Otherwise, I really don't know. In combination with your inability to get -collapse (p50)- to accept -pweights- it makes me wonder if there is something wrong with your Stata installation.
Comment
Lars Pete

Join Date: Nov 2020

Posts: 118
#5

29 Nov 2020, 20:26

Hi Clyde,

Thank you for the detailed reply again. I have realized that I should be more careful in wording my sentences and should be more specific, which I will try to be.

1. My version of STATA is STATA IC/16.1. I have updated it. Now p weight with collapse does work! And magically, I am getting line plots now with the same commands, which I was using before. It's like STATA listened to our interaction and corrected itself!

Interestingly, after collapse regress and margins-plot give the same result as twoway line plot. But as you said, I will do two way line plot as regression doesn't make sense here. Also, just as you said aweights and pweights are interchangeable in STATA, they are giving same values of medians after collapse. But I will use pweights nevertheless.

2. [No, you can't do that. -svyset- just records the design information in the data set and makes it accessible to other commands. But other commands do not use that data unless they are told to by the -svy:- prefix.]

Yes, you are very right. I did bysort year race: egen med_income = median(income) and generated new variables that I had to before specifying survey design svyset. After doing bysort, I did:

svyset id [pweight=weight], strata(strata)
svy: regress med_income year#race, baselevels
margins year#race
marginsplot

But then the values of median will completely be different than those calculated by weighting and will not be what they're supposed to be. egen after specifying svyset wouldn't make sense as we have to specify svy after specifying svyset and svy is not compatible with egen.

Just need to know your final opinion, should I conclude that svyset is just not a good option here since I won't get the correct weighted values of median anyways (except with collapse) and that collapse and twoline plot is the only way?

Thank you again.

Best,
Lars.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#6

29 Nov 2020, 20:56

Just need to know your final opinion, should I conclude that svyset is just not a good option here since I won't get the correct weighted values of median anyways (except with collapse) and that collapse and twoline plot is the only way?

Well, if all you are going to do with the data is plot these graphs of the medians, then, I would agree with that. But if you have other analyses in mind, it is likely that -svyset-ing the data and using -svy:- estimation commands is almost surely the way to go for those purposes.

Let me just re-emphasize that while you can get unbiased estimates of medians using only weighting, you cannot calculate valid standard errors, confidence intervals, test statistics, and p-values without also incorporating the information about stratification and sampling units (i.e. precisely the information needed to use -svyset-). If documentation of these design features is not available, you are left with no ability to do analytic work with this data: you can only provide simple descriptive statistics.
Comment
Lars Pete

Join Date: Nov 2020

Posts: 118
#7

29 Nov 2020, 21:09

This finally makes it all clear.
Thank you very much.

Best,
Lars.
Comment

Announcement

Survey data- Pool Cross Sectional data with weights- What to do with weights?

Comment

Comment

Comment

Comment

Comment

Comment