Comparing subpopulations using survey data

Shawna Bayerman

Join Date: Jun 2023

Posts: 9
#1

Comparing subpopulations using survey data

21 Jul 2023, 13:36

Hi all,

I am using NHANES data to compare the use of a dietary aid between two racial-ethnic groups. This is my first time working with survey data. I am trying to figure out the best way to code for my inclusion criteria so that I am still including the weights of other populations in my analysis without actually including those groups.

I have already survey set my data.

In the following code I create a variable that includes mexican american and non-hispanic whites (my groups of interest) who are not pregnant and above the age of 18.

gen inanalysis=0;
replace inanalysis=1 if ridreth1== 1 | ridreth1== 3;
replace inanalysis=0 if ridageyr<=17;
replace inanalysis=0 if ridexprg==1;

I then want to run a logistic regression that regresses race-ethnicity on my dichotomous outcome (do you use this dietary aid - yes or no). The problem I am having is that my groups of interest are both coded as 1 while all other populations are coded as 0.

For example:
svy: logit cbq611 inanalysis;

I then tried to separate my two subpopulation, both over 18 and not pregnant, as below.

gen mex=0;
replace mex=1 if ridreth1== 1;
replace mex=0 if ridageyr<=17;
replace mex=0 if ridexprg==1;
tab mex;

#delimit;
gen nhw=0;
replace nhw=1 if ridreth1== 3;
replace nhw=0 if ridageyr<=17;
replace nhw=0 if ridexprg==1;
tab nhw;

The issue with this is when I put mexican americans (ma) into a logistic regression (as below) then non-hispanic whites as well as the other racial-ethnic groups are coded as zero so I'm not actually comparing the two groups .

svy: logit cbq611 ma;

How to code for my inclusion criteria (Mexican American or non-Hispanic White, not pregnant and over 18) and subsequently compare the two groups in a logistic regression without dropping all of the other racial ethnic groups and their survey weights?

Thank you all so much.

Last edited by Shawna Bayerman; 21 Jul 2023, 14:05.
Tags: logit, surveydata, syntax

Andrew Musau

Join Date: Oct 2014
Posts: 10260

22 Jul 2023, 08:55

Assuming there is no intersection between the two groups (mex and nhw), you can create a 3 level categorical variable. The following is not tested and may contain typos and errors.

Code:

assert !(mex & nhw)
gen ethnicity2= cond(mex, 1, cond(nhw, 2, 3))
label define eth2 1 "Mexican American" 2 "non-Hispanic White" 3 "Other"
label values ethnicity2 eth2
svy: logit cbq611 ib3.ethnicity2
*WALD TEST FOR EQUALITY OF COEFFICIENTS
test 1.ethnicity2=2.ethnicity2

Comment

Shawna Bayerman

Join Date: Jun 2023

Posts: 9
#3

01 Aug 2023, 12:37

Thank you so much! Can I just clarify for my understanding. When you use the syntax -
ib3.ethnicity2 - is this setting the group "other" as the reference group? I think what I am trying to do is set either Mex or NHW as a reference group and compare one to the other while not factoring in actual values of the "other" group but still keep their weights. .
Comment
Shawna Bayerman

Join Date: Jun 2023

Posts: 9
#4

01 Aug 2023, 12:55

[QUOTE=Andrew Musau;n1721483]

Thank you so much! Can I just clarify for my understanding. When you use the syntax -
ib3.ethnicity2 - is this setting the group "other" as the reference group? I think what I am trying to do is set either Mex or NHW as a reference group and compare one to the other while not factoring in actual values of the "other" group but still keep their weights. .
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10260

01 Aug 2023, 13:25

Originally posted by Shawna Bayerman View Post

Can I just clarify for my understanding. When you use the syntax -
ib3.ethnicity2 - is this setting the group "other" as the reference group?

Yes.

I think what I am trying to do is set either Mex or NHW as a reference group and compare one to the other while not factoring in actual values of the "other" group but still keep their weights

I am not very clear with what you mean by the highlighted. You can take a sub-population consisting of Mexican Americans and Non-Hispanic-Whites and run the estimation. The -svy- settings are still in effect, so you are taking into account the survey weights of this sub-population. Assume below that my sub-population is region 1 and region 2 (NE and MW).

Code:

webuse nhanes2d,clear
tab region
gen subpop= inlist(region, 1,2)
svyset
svy, subpop(subpop): logit highbp ib1.region

Res.:

Code:

. tab region

1=NE, 2=MW, |
   3=S, 4=W |      Freq.     Percent        Cum.
------------+-----------------------------------
         NE |      2,096       20.25       20.25
         MW |      2,774       26.80       47.05
          S |      2,853       27.56       74.61
          W |      2,628       25.39      100.00
------------+-----------------------------------
      Total |     10,351      100.00

. 
. gen subpop= inlist(region, 1,2)

. 
. svyset

Sampling weights: finalwgt
             VCE: linearized
     Single unit: missing
        Strata 1: strata
 Sampling unit 1: psu
           FPC 1: <zero>

. 
. svy, subpop(subpop): logit highbp ib1.region
(running logit on estimation sample)

Survey: Logistic regression

Number of strata = 15                             Number of obs   =      4,870
Number of PSUs   = 30                             Population size = 53,401,690
                                                  Subpop. no. obs =      4,870
                                                  Subpop. size    = 53,401,690
                                                  Design df       =         15
                                                  F(1, 15)        =       1.11
                                                  Prob > F        =     0.3079

------------------------------------------------------------------------------
             |             Linearized
      highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      region |
         MW  |  -.2070647   .1961878    -1.06   0.308     -.625229    .2110997
       _cons |  -.4229191   .1370355    -3.09   0.008    -.7150033   -.1308349
------------------------------------------------------------------------------
Note: 16 strata omitted because they contain no subpopulation members.

Comment

Shawna Bayerman

Join Date: Jun 2023

Posts: 9
#6

03 Aug 2023, 14:06

Thank you for that example. In your example it is easier for me to understand because you have two groups and are then applying the survey settings. Maybe I am struggling to understand conceptually. I have a categorical variable that includes 5 racial ethnic groups. I am only interested in comparing MA and NHW. With survey data I am not able to drop the other 3 groups because this would impact the survey weighting. So I want to treat the three other groups as if they are a zero, I.E. not include them in my sampling frame, yet still have their weights factored in. My concern is that by assigning the other category a number whether its zero or 3, does this still include them in the coefficient estimation of a logistic regression? I want to just have the coefficient represent the comparison between MA and NHW. Not compare MA and NHW to the "other" category.

I think this is where my question your first code stemmed from. ib3 treats the "other" category as a base - is that the same as treating it as a reference category?

svy: logit cbq611 ib3.ethnicity2
Thank you so much for taking the time to respond to my questions, I truly appreciate your input.

Originally posted by Andrew Musau View Post

Yes.

I am not very clear with what you mean by the highlighted. You can take a sub-population consisting of Mexican Americans and Non-Hispanic-Whites and run the estimation. The -svy- settings are still in effect, so you are taking into account the survey weights of this sub-population. Assume below that my sub-population is region 1 and region 2 (NE and MW).

Code:

webuse nhanes2d,clear tab region gen subpop= inlist(region, 1,2) svyset svy, subpop(subpop): logit highbp ib1.region

Res.:

Code:

. tab region 1=NE, 2=MW, | 3=S, 4=W | Freq. Percent Cum. ------------+----------------------------------- NE | 2,096 20.25 20.25 MW | 2,774 26.80 47.05 S | 2,853 27.56 74.61 W | 2,628 25.39 100.00 ------------+----------------------------------- Total | 10,351 100.00 . . gen subpop= inlist(region, 1,2) . . svyset Sampling weights: finalwgt VCE: linearized Single unit: missing Strata 1: strata Sampling unit 1: psu FPC 1: <zero> . . svy, subpop(subpop): logit highbp ib1.region (running logit on estimation sample) Survey: Logistic regression Number of strata = 15 Number of obs = 4,870 Number of PSUs = 30 Population size = 53,401,690 Subpop. no. obs = 4,870 Subpop. size = 53,401,690 Design df = 15 F(1, 15) = 1.11 Prob > F = 0.3079 ------------------------------------------------------------------------------ | Linearized highbp | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- region | MW | -.2070647 .1961878 -1.06 0.308 -.625229 .2110997 _cons | -.4229191 .1370355 -3.09 0.008 -.7150033 -.1308349 ------------------------------------------------------------------------------ Note: 16 strata omitted because they contain no subpopulation members.
Comment
Shawna Bayerman

Join Date: Jun 2023

Posts: 9
#7

03 Aug 2023, 14:33

To further clarify, when I run the code you generously wrote out above. I get an estimate for both Mexican Americans and Non-Hispanic Whites which leads me to beleive that I am comparing both of these groups to a third rather than comparing just to one another.

assert !(mex & nhw);

. gen ethnicity2= cond(mex, 1, cond(nhw, 2, 3));

. label define eth2 1 "Mexican American" 2 "non-Hispanic White" 3 "Other";

. label values ethnicity2 eth2;

. svy: logit cbq611 ib3.ethnicity2;
(running logit on estimation sample)

Survey: Logistic regression

Number of strata = 54 Number of obs = 4,652
Number of PSUs = 109 Population size = 93,684,969
Design df = 55
F(2, 54) = 5.58
Prob > F = 0.0063

-------------------------------------------------------------------------------------
| Linearized
cbq611 | Coefficient std. err. t P>|t| [95% conf. interval]
--------------------+----------------------------------------------------------------
ethnicity2 |
Mexican American | .2860224 .1223861 2.34 0.023 .0407551 .5312897
non-Hispanic White | -.1458525 .0904848 -1.61 0.113 -.3271881 .0354832
|
_cons | -.5786696 .0714312 -8.10 0.000 -.7218209 -.4355183
-------------------------------------------------------------------------------------
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10260

03 Aug 2023, 14:40

Originally posted by Shawna Bayerman View Post

I want to just have the coefficient represent the comparison between MA and NHW. Not compare MA and NHW to the "other" category.

I think this is where my question your first code stemmed from. ib3 treats the "other" category as a base - is that the same as treating it as a reference category?

Not exactly. The first choice is whether you should estimate using the full sample or some subsample drawn from the full sample. My understanding is that you want the former. You can set the reference category (also referred to as base category) to whatever you like, but the coefficients represent pairwise comparisons to this reference group. If the to be compared groups are in the regression, you can use the test command. The result will be the same as setting one of the groups as the reference group.

Code:

webuse nhanes2d,clear
svyset
svy:logit highbp ib1.region
test 2.region=3.region
svy: logit highbp ib2.region

Res.:

Code:

. svy:logit highbp ib1.region
(running logit on estimation sample)

Survey: Logistic regression

Number of strata = 31                            Number of obs   =      10,351
Number of PSUs   = 62                            Population size = 117,157,513
                                                 Design df       =          31
                                                 F(3, 29)        =        0.36
                                                 Prob > F        =      0.7848

------------------------------------------------------------------------------
             |             Linearized
      highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      region |
         MW  |  -.2070647   .1961878    -1.06   0.299    -.6071922    .1930629
          S  |  -.1113247   .1764428    -0.63   0.533    -.4711822    .2485327
          W  |  -.1258911    .173949    -0.72   0.475    -.4806625    .2288803
             |
       _cons |  -.4229191   .1370355    -3.09   0.004    -.7024048   -.1434334
------------------------------------------------------------------------------

.
. test 2.region=3.region

Adjusted Wald test

 ( 1)  [highbp]2.region - [highbp]3.region = 0

       F(  1,    31) =    0.29
            Prob > F =    0.5967

.
.
.
. svy: logit highbp ib2.region
(running logit on estimation sample)

Survey: Logistic regression

Number of strata = 31                            Number of obs   =      10,351
Number of PSUs   = 62                            Population size = 117,157,513
                                                 Design df       =          31
                                                 F(3, 29)        =        0.36
                                                 Prob > F        =      0.7848

------------------------------------------------------------------------------
             |             Linearized
      highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      region |
         NE  |   .2070647   .1961878     1.06   0.299    -.1930629    .6071922
          S  |   .0957399   .1790649     0.53   0.597    -.2694654    .4609453
          W  |   .0811735   .1766082     0.46   0.649    -.2790213    .4413683
             |
       _cons |  -.6299838   .1403956    -4.49   0.000    -.9163224   -.3436451
------------------------------------------------------------------------------

Here, test reports an F statistic whereas the regression table reports a t-statistic. But \(t=\sqrt{F}\), so that \(\sqrt{0.29} = 0.53\) from above. But the main point is that such comparisons are pairwise and do not involve the other groups, whereas the regression used the full-sample to estimate the coefficients.

Last edited by Andrew Musau; 03 Aug 2023, 14:53.

Announcement