Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing subpopulations using survey data

    Hi all,

    I am using NHANES data to compare the use of a dietary aid between two racial-ethnic groups. This is my first time working with survey data. I am trying to figure out the best way to code for my inclusion criteria so that I am still including the weights of other populations in my analysis without actually including those groups.

    I have already survey set my data.

    In the following code I create a variable that includes mexican american and non-hispanic whites (my groups of interest) who are not pregnant and above the age of 18.

    gen inanalysis=0;
    replace inanalysis=1 if ridreth1== 1 | ridreth1== 3;
    replace inanalysis=0 if ridageyr<=17;
    replace inanalysis=0 if ridexprg==1;

    I then want to run a logistic regression that regresses race-ethnicity on my dichotomous outcome (do you use this dietary aid - yes or no). The problem I am having is that my groups of interest are both coded as 1 while all other populations are coded as 0.

    For example:
    svy: logit cbq611 inanalysis;

    I then tried to separate my two subpopulation, both over 18 and not pregnant, as below.

    gen mex=0;
    replace mex=1 if ridreth1== 1;
    replace mex=0 if ridageyr<=17;
    replace mex=0 if ridexprg==1;
    tab mex;


    #delimit;
    gen nhw=0;
    replace nhw=1 if ridreth1== 3;
    replace nhw=0 if ridageyr<=17;
    replace nhw=0 if ridexprg==1;
    tab nhw;

    The issue with this is when I put mexican americans (ma) into a logistic regression (as below) then non-hispanic whites as well as the other racial-ethnic groups are coded as zero so I'm not actually comparing the two groups .

    svy: logit cbq611 ma;


    How to code for my inclusion criteria (Mexican American or non-Hispanic White, not pregnant and over 18) and subsequently compare the two groups in a logistic regression without dropping all of the other racial ethnic groups and their survey weights?

    Thank you all so much.
    Last edited by Shawna Bayerman; 21 Jul 2023, 14:05.

  • #2
    Assuming there is no intersection between the two groups (mex and nhw), you can create a 3 level categorical variable. The following is not tested and may contain typos and errors.

    Code:
    assert !(mex & nhw)
    gen ethnicity2= cond(mex, 1, cond(nhw, 2, 3))
    label define eth2 1 "Mexican American" 2 "non-Hispanic White" 3 "Other"
    label values ethnicity2 eth2
    svy: logit cbq611 ib3.ethnicity2
    *WALD TEST FOR EQUALITY OF COEFFICIENTS
    test 1.ethnicity2=2.ethnicity2

    Comment


    • #3
      Thank you so much! Can I just clarify for my understanding. When you use the syntax -
      ib3.ethnicity2 - is this setting the group "other" as the reference group? I think what I am trying to do is set either Mex or NHW as a reference group and compare one to the other while not factoring in actual values of the "other" group but still keep their weights. .

      Comment


      • #4
        [QUOTE=Andrew Musau;n1721483]

        Thank you so much! Can I just clarify for my understanding. When you use the syntax -
        ib3.ethnicity2 - is this setting the group "other" as the reference group? I think what I am trying to do is set either Mex or NHW as a reference group and compare one to the other while not factoring in actual values of the "other" group but still keep their weights. .

        Comment


        • #5
          Originally posted by Shawna Bayerman View Post
          Can I just clarify for my understanding. When you use the syntax -
          ib3.ethnicity2 - is this setting the group "other" as the reference group?
          Yes.

          I think what I am trying to do is set either Mex or NHW as a reference group and compare one to the other while not factoring in actual values of the "other" group but still keep their weights
          I am not very clear with what you mean by the highlighted. You can take a sub-population consisting of Mexican Americans and Non-Hispanic-Whites and run the estimation. The -svy- settings are still in effect, so you are taking into account the survey weights of this sub-population. Assume below that my sub-population is region 1 and region 2 (NE and MW).

          Code:
          webuse nhanes2d,clear
          tab region
          gen subpop= inlist(region, 1,2)
          svyset
          svy, subpop(subpop): logit highbp ib1.region
          Res.:

          Code:
          . tab region
          
          1=NE, 2=MW, |
             3=S, 4=W |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                   NE |      2,096       20.25       20.25
                   MW |      2,774       26.80       47.05
                    S |      2,853       27.56       74.61
                    W |      2,628       25.39      100.00
          ------------+-----------------------------------
                Total |     10,351      100.00
          
          . 
          . gen subpop= inlist(region, 1,2)
          
          . 
          . svyset
          
          Sampling weights: finalwgt
                       VCE: linearized
               Single unit: missing
                  Strata 1: strata
           Sampling unit 1: psu
                     FPC 1: <zero>
          
          . 
          . svy, subpop(subpop): logit highbp ib1.region
          (running logit on estimation sample)
          
          Survey: Logistic regression
          
          Number of strata = 15                             Number of obs   =      4,870
          Number of PSUs   = 30                             Population size = 53,401,690
                                                            Subpop. no. obs =      4,870
                                                            Subpop. size    = 53,401,690
                                                            Design df       =         15
                                                            F(1, 15)        =       1.11
                                                            Prob > F        =     0.3079
          
          ------------------------------------------------------------------------------
                       |             Linearized
                highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                region |
                   MW  |  -.2070647   .1961878    -1.06   0.308     -.625229    .2110997
                 _cons |  -.4229191   .1370355    -3.09   0.008    -.7150033   -.1308349
          ------------------------------------------------------------------------------
          Note: 16 strata omitted because they contain no subpopulation members.

          Comment


          • #6
            Thank you for that example. In your example it is easier for me to understand because you have two groups and are then applying the survey settings. Maybe I am struggling to understand conceptually. I have a categorical variable that includes 5 racial ethnic groups. I am only interested in comparing MA and NHW. With survey data I am not able to drop the other 3 groups because this would impact the survey weighting. So I want to treat the three other groups as if they are a zero, I.E. not include them in my sampling frame, yet still have their weights factored in. My concern is that by assigning the other category a number whether its zero or 3, does this still include them in the coefficient estimation of a logistic regression? I want to just have the coefficient represent the comparison between MA and NHW. Not compare MA and NHW to the "other" category.

            I think this is where my question your first code stemmed from. ib3 treats the "other" category as a base - is that the same as treating it as a reference category?

            svy: logit cbq611 ib3.ethnicity2
            Thank you so much for taking the time to respond to my questions, I truly appreciate your input.

            Originally posted by Andrew Musau View Post

            Yes.



            I am not very clear with what you mean by the highlighted. You can take a sub-population consisting of Mexican Americans and Non-Hispanic-Whites and run the estimation. The -svy- settings are still in effect, so you are taking into account the survey weights of this sub-population. Assume below that my sub-population is region 1 and region 2 (NE and MW).

            Code:
            webuse nhanes2d,clear
            tab region
            gen subpop= inlist(region, 1,2)
            svyset
            svy, subpop(subpop): logit highbp ib1.region
            Res.:

            Code:
            . tab region
            
            1=NE, 2=MW, |
            3=S, 4=W | Freq. Percent Cum.
            ------------+-----------------------------------
            NE | 2,096 20.25 20.25
            MW | 2,774 26.80 47.05
            S | 2,853 27.56 74.61
            W | 2,628 25.39 100.00
            ------------+-----------------------------------
            Total | 10,351 100.00
            
            .
            . gen subpop= inlist(region, 1,2)
            
            .
            . svyset
            
            Sampling weights: finalwgt
            VCE: linearized
            Single unit: missing
            Strata 1: strata
            Sampling unit 1: psu
            FPC 1: <zero>
            
            .
            . svy, subpop(subpop): logit highbp ib1.region
            (running logit on estimation sample)
            
            Survey: Logistic regression
            
            Number of strata = 15 Number of obs = 4,870
            Number of PSUs = 30 Population size = 53,401,690
            Subpop. no. obs = 4,870
            Subpop. size = 53,401,690
            Design df = 15
            F(1, 15) = 1.11
            Prob > F = 0.3079
            
            ------------------------------------------------------------------------------
            | Linearized
            highbp | Coefficient std. err. t P>|t| [95% conf. interval]
            -------------+----------------------------------------------------------------
            region |
            MW | -.2070647 .1961878 -1.06 0.308 -.625229 .2110997
            _cons | -.4229191 .1370355 -3.09 0.008 -.7150033 -.1308349
            ------------------------------------------------------------------------------
            Note: 16 strata omitted because they contain no subpopulation members.

            Comment


            • #7
              To further clarify, when I run the code you generously wrote out above. I get an estimate for both Mexican Americans and Non-Hispanic Whites which leads me to beleive that I am comparing both of these groups to a third rather than comparing just to one another.



              assert !(mex & nhw);

              . gen ethnicity2= cond(mex, 1, cond(nhw, 2, 3));

              . label define eth2 1 "Mexican American" 2 "non-Hispanic White" 3 "Other";

              . label values ethnicity2 eth2;

              . svy: logit cbq611 ib3.ethnicity2;
              (running logit on estimation sample)

              Survey: Logistic regression

              Number of strata = 54 Number of obs = 4,652
              Number of PSUs = 109 Population size = 93,684,969
              Design df = 55
              F(2, 54) = 5.58
              Prob > F = 0.0063

              -------------------------------------------------------------------------------------
              | Linearized
              cbq611 | Coefficient std. err. t P>|t| [95% conf. interval]
              --------------------+----------------------------------------------------------------
              ethnicity2 |
              Mexican American | .2860224 .1223861 2.34 0.023 .0407551 .5312897
              non-Hispanic White | -.1458525 .0904848 -1.61 0.113 -.3271881 .0354832
              |
              _cons | -.5786696 .0714312 -8.10 0.000 -.7218209 -.4355183
              -------------------------------------------------------------------------------------

              Comment


              • #8
                Originally posted by Shawna Bayerman View Post
                I want to just have the coefficient represent the comparison between MA and NHW. Not compare MA and NHW to the "other" category.

                I think this is where my question your first code stemmed from. ib3 treats the "other" category as a base - is that the same as treating it as a reference category?

                Not exactly. The first choice is whether you should estimate using the full sample or some subsample drawn from the full sample. My understanding is that you want the former. You can set the reference category (also referred to as base category) to whatever you like, but the coefficients represent pairwise comparisons to this reference group. If the to be compared groups are in the regression, you can use the test command. The result will be the same as setting one of the groups as the reference group.

                Code:
                webuse nhanes2d,clear
                svyset
                svy:logit highbp ib1.region
                test 2.region=3.region
                svy: logit highbp ib2.region
                Res.:

                Code:
                . svy:logit highbp ib1.region
                (running logit on estimation sample)
                
                Survey: Logistic regression
                
                Number of strata = 31                            Number of obs   =      10,351
                Number of PSUs   = 62                            Population size = 117,157,513
                                                                 Design df       =          31
                                                                 F(3, 29)        =        0.36
                                                                 Prob > F        =      0.7848
                
                ------------------------------------------------------------------------------
                             |             Linearized
                      highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                      region |
                         MW  |  -.2070647   .1961878    -1.06   0.299    -.6071922    .1930629
                          S  |  -.1113247   .1764428    -0.63   0.533    -.4711822    .2485327
                          W  |  -.1258911    .173949    -0.72   0.475    -.4806625    .2288803
                             |
                       _cons |  -.4229191   .1370355    -3.09   0.004    -.7024048   -.1434334
                ------------------------------------------------------------------------------
                
                .
                . test 2.region=3.region
                
                Adjusted Wald test
                
                 ( 1)  [highbp]2.region - [highbp]3.region = 0
                
                       F(  1,    31) =    0.29
                            Prob > F =    0.5967
                
                .
                .
                .
                . svy: logit highbp ib2.region
                (running logit on estimation sample)
                
                Survey: Logistic regression
                
                Number of strata = 31                            Number of obs   =      10,351
                Number of PSUs   = 62                            Population size = 117,157,513
                                                                 Design df       =          31
                                                                 F(3, 29)        =        0.36
                                                                 Prob > F        =      0.7848
                
                ------------------------------------------------------------------------------
                             |             Linearized
                      highbp | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                      region |
                         NE  |   .2070647   .1961878     1.06   0.299    -.1930629    .6071922
                          S  |   .0957399   .1790649     0.53   0.597    -.2694654    .4609453
                          W  |   .0811735   .1766082     0.46   0.649    -.2790213    .4413683
                             |
                       _cons |  -.6299838   .1403956    -4.49   0.000    -.9163224   -.3436451
                ------------------------------------------------------------------------------
                Here, test reports an F statistic whereas the regression table reports a t-statistic. But \(t=\sqrt{F}\), so that \(\sqrt{0.29} = 0.53\) from above. But the main point is that such comparisons are pairwise and do not involve the other groups, whereas the regression used the full-sample to estimate the coefficients.
                Last edited by Andrew Musau; 03 Aug 2023, 14:53.

                Comment

                Working...
                X