Weighting subsample to full sample- using STATA-logistic

Dan Su

Join Date: Mar 2017

Posts: 29
#1

Weighting subsample to full sample- using STATA-logistic

08 Mar 2017, 07:42

Hi experts,

My full sample is ~6000 participants and the subsample is ~350 participants. ~350 pts were chosen from the full sample to run another medical test. The associations between primary predictor X and outcome Y are significant across different adjustment models (p<0.05) for the full sample. However, these associations between X and Y are not significant for the subsample. We thought it's because the subsample has different age/race/.../ distributions compared with the full sample. For example, the full sample has over 70% blacks while in the subsample almost half were blacks. So I thought we should "up" weight the subsample to the full sample, to make the subsample similar to the full sample. And we expect to have significant results for the associations between X and Y for the subsample.

1. After we create the weights, we include `[pweight=wt]` in the regression models, but the associations between Y and X for subsample are still not significant even after doing upweighting. Can you please provide any suggestion why it is not significant?

2. Is the weight created correctly(see codes below)? I use pweights instead of fweight because we dont know how the subsample was selected. Is `gen wt = ( obspr / _b[obspr] ) * e(N)` correct? Or anything wrong with my codes?

`full.dta` is full sample dataset, with the "sub" indicator to indicate whether this participant is in the subsample or not.
`sub.dta` is the subsample dataset. Below is my STATA codes. Thanks!

use "1-data\full.dta", clear
keep if sub==1 // sub is the indicator of subsample
save "1-data\sub.dta", replace

use "1-data\full.dta", clear
*predict probability of being selected for the subsample using
logistic sub a b c d
*only list covariates a b c d to predict, because primary predictor X and other covariates
have too much missing by looking at missing data patterns with "misstable tree"

predict obspr , p
quietly total obspr // get sum of the probs
gen wt = ( obspr / _b[obspr] ) * e(N)
*_b[obspr] is the sum of obspr, e(N) is subsample size, I think wt=(p/sum of p)*N

*codebook wt if sub // check coverage of the weighting probs (nearly all)
Tags: logistic, sample, stata, weighting, weights
Nick Cox

Join Date: Mar 2014

Posts: 35754
#2

08 Mar 2017, 09:13

Cross-posted from http://stackoverflow.com/questions/4...tic-regression

I did suggest there that you post here, but you are also asked to tell us about cross-posting. http://www.statalist.org/forums/help#crossposting
Comment
Dan Su

Join Date: Mar 2017

Posts: 29
#3

08 Mar 2017, 11:41

Originally posted by Nick Cox View Post

Cross-posted from http://stackoverflow.com/questions/4...tic-regression

I did suggest there that you post here, but you are also asked to tell us about cross-posting. http://www.statalist.org/forums/help#crossposting

Thanks for reminder Nick!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5011
#4

08 Mar 2017, 13:17

The pweights may get the relative weights (and hence the point estimates) right, but the point remains that the subsample only has 350 cases while the full sample has 6,000 cases. So, a much smaller sample size alone could account for the lack of statistical significance. How do the actual estimates compare between full and subsample? If more or less similar, this could reinforce the idea that differences in sample size are what is critical here, rather than differences in the relationship between X and Y.

Further, you say you don't know how the subsample was selected. So the relationship between X and Y may be different for it than it is for the full sample. Who knows, maybe these cases were selected precisely because X and Y did not seem to be related for them.

I haven't checked your weight coding. But I suspect the above points may be the most critical ones.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Dan Su

Join Date: Mar 2017

Posts: 29
#5

08 Mar 2017, 13:40

Originally posted by Richard Williams View Post

The pweights may get the relative weights (and hence the point estimates) right, but the point remains that the subsample only has 350 cases while the full sample has 6,000 cases. So, a much smaller sample size alone could account for the lack of statistical significance. How do the actual estimates compare between full and subsample? If more or less similar, this could reinforce the idea that differences in sample size are what is critical here, rather than differences in the relationship between X and Y.

Further, you say you don't know how the subsample was selected. So the relationship between X and Y may be different for it than it is for the full sample. Who knows, maybe these cases were selected precisely because X and Y did not seem to be related for them.

I haven't checked your weight coding. But I suspect the above points may be the most critical ones.

Thanks for your reply Richard I agree really small sample size could not guarantee we can have the same significant results as the full sample did. And yes,the estimates are similar compare between full and subsample. Only the significance changed. Could you also look at my weight coding? I'm also curious did I code it correctly. Thanks again!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5011
#6

08 Mar 2017, 13:55

Well, you should know from your full sample what the % black should be as well as whatever other variables you think you need to adjust for, So, I would do something like

svy: mean black

or maybe something like

svy: tabulate race

or maybe

svy: tabulate race gender

If it doesn't look right, you can go over your code more carefully. And even if your code did look right, you should do something like this to confirm that there isn't some problem you overlooked.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment

Announcement

Weighting subsample to full sample- using STATA-logistic

Comment

Comment

Comment

Comment

Comment