Sampling N individuals with weights

Luis Afonso

Join Date: Apr 2017

Posts: 2
#1

Sampling N individuals with weights

28 Apr 2017, 13:38

I am using a national household survey. I have to sample this database, per group, in such a way that each group, weighted by the sample weight, has the number of people I want.
To make it clearer, suppose that I have two groups, by gender (male and female). There are 400,000 people in the survey, who represent 200,000,000 people in the population.
I need to sample (and keep) X men in the database and Y women in such a way that that the number X is equal (with the sample weight) to 20,000,000 and Y is equal (with the sample weight) to 15,000,000. That is, before sampling, I don’t know the values of X and Y that I have to choose; I only know that X (expanding the sample, using the weights) must be equal to 20,000,00; and Y (expanding the sample, using the weights) must be equal to 15,000,00.
I tried to use the command gsample. But it didn’t work, because, I only could sample N individuals, without using the weights.
I also searched for the same doubt in older posts, but I couldn’t find anything similar.
Does anybody can help me?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

28 Apr 2017, 14:41

I'm not sure I really understand what you want to do. But I take it that each observation in your data set includes a variable identifying the group (gender, a string variable coded "Male" and "Female"), and has some weight associated with it in a variable I'll call wt. You want a random sample of each group such that the total of the weights in one group is 20,000,000 and 15,000,000 in the other. Is that right? If so, you can get pretty close to that with:

Code:

set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle1 = runiform() gen double shuffle2 = runiform() sort gender shuffle1 shuffle2 by gender: gen running_sum_wt = sum(wt) keep if running_sum_wt <= cond(gender == "Male", 20000000, 15000000)

Note: Depending on the actual values of the weights, it may not be possible to get the total of the weights to be exactly 20,000,000 and 15,000,000. This will get you as close as possible without exceeding those thresholds.

That said, this strikes me as an odd thing to do. Perhaps what you really need is post-stratification weighting? See -help svyset-.
Comment
Luis Afonso

Join Date: Apr 2017

Posts: 2
#3

28 Apr 2017, 17:24

Originally posted by Clyde Schechter View Post

I'm not sure I really understand what you want to do. But I take it that each observation in your data set includes a variable identifying the group (gender, a string variable coded "Male" and "Female"), and has some weight associated with it in a variable I'll call wt. You want a random sample of each group such that the total of the weights in one group is 20,000,000 and 15,000,000 in the other. Is that right? If so, you can get pretty close to that with:

Code:

set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle1 = runiform() gen double shuffle2 = runiform() sort gender shuffle1 shuffle2 by gender: gen running_sum_wt = sum(wt) keep if running_sum_wt <= cond(gender == "Male", 20000000, 15000000)

Note: Depending on the actual values of the weights, it may not be possible to get the total of the weights to be exactly 20,000,000 and 15,000,000. This will get you as close as possible without exceeding those thresholds.

That said, this strikes me as an odd thing to do. Perhaps what you really need is post-stratification weighting? See -help svyset-.

Thank you, for your quick answer.
The solution you provided was exactly what I needed.
Actually the command svyset doesn't fit in this case.

Tks!
Comment

Announcement

Sampling N individuals with weights

Comment

Comment