Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to include all categories of a categorical co-variate in a logistical regression

    Hi All,

    I am fairly new to STATA so I apologize if this question may come off as silly. I am currently trying to run an analysis using the 2012 NHIS Adult Alternative Medicine File survey. In particular, I am trying to run a logistic regression using sex as one of my co-variates (independent variables). However, every time I enter the following command:

    svy: logistic eczema i.acuseyr i.sex i.racenew i.educ age incfam07on

    Only female sex comes up in my regression, despite there being many responses from males as well. I have noticed that "i. sex" does this in every single regression I run. I have attached a sample of a simplistic regression below (also includes the 2 way table as well):

    Click image for larger version

Name:	Screen Shot 2021-08-24 at 11.05.28 AM.png
Views:	1
Size:	126.9 KB
ID:	1624703


    any ideas on how I can include males in my regression as well?

    Thank you!



  • #2
    the way to do requires two changes:
    Code:
    logistic eczema ibn.sex, nocons
    the ibn tells Stata not to use a base level but then you need the "nocons" option to avoid the "dummy variable trap"; see
    Code:
    help fvvarlist
    of course, maybe what you really want is to show males as the base level but keep the constant; in that case, add the baselevels option to your command as shown above

    by the way, please read the FAQ on the best way to post results

    Comment


    • #3
      That is as it should be. The males are the reference category. An effect is a comparison. In your case you found that the odds of getting eczema is 1.2 times larger than that odds for males. If you know that then you cannot get a separate estimate for the effect of being male: if females have a 1.2 times higher odds than males then males have a 1/1.2 =0.83 times smaller odds than females. Since this is a completely deterministic relationship, there is nothing to estimate, so no statistics program can estimate it, nor would it be desirable.

      For presentation purposes you might want to show predicted probabilities for males and females. That is possible, since predicted probabilities are not comparisons. You would use margins for that.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        It's not clear what you mean by " include males in my regression". They are included in the estimation (have a look at the number of observations).

        The estimated odds ratio labelled "females" is the estimated odds of the outcome for females divided by the estimated odds of the outcome for males.

        Do the following in your calculator (numbers extracted from the 2 by 2 tables) and you will get 1.209 (same as the estimated OR from the logistic model).

        Code:
        (20309*2374)/(1706*23372)
        If you issue the command

        Code:
        set showbaselevels on
        then Stata will include a row in the table of parameter estimates for males (the reference level).

        I don't believe this is Stata specific. In my experience, almost all software gives the same table of parameter estimates.

        Comment


        • #5
          My and Rich's answer seem to contradict one another, but they do not. Rich's suggestion to remove the constant and add the males is similar to my suggestion to show predicted probabilities. Both do not result in effects, as effects imply a comparison. Rich's solution gives you the predicted odds for men and women.

          Also see this Stata tip: http://maartenbuis.nl/publications/ref_cat.html
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Hi All!

            Thank you for your help. In hopes of clarifying what I mean, for all my other categorical variables, I am given a predicted odds ratio for "every category" or every possible answer choice for the question asked by the survey, except for males under sex. I have tried your method rich but I seem to get very different numbers if I run it that way. I think I'm just a little confused about whether I should run it with base levels or not, or if there's a way to include the category of males in the table shown below?

            Please let me know if you need more info!

            . svy: logistic eczema i.acuseyr i.sex i.racenew i.educ age incfam07on
            (running logistic on estimation sample)

            Survey: Logistic regression

            Number of strata = 300 Number of obs = 47,761
            Number of PSUs = 600 Population size = 137,181,562
            Design df = 300
            F( 35, 266) = 11.91
            Prob > F = 0.0000

            ---------------------------------------------------------------------------------------------------------
            | Linearized
            eczema | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
            ----------------------------------------+----------------------------------------------------------------
            acuseyr |
            No | 1.542349 .1650304 4.05 0.000 1.249498 1.903836
            Yes | 1.691821 1.296713 0.69 0.493 .3743689 7.645555
            |
            sex |
            Female | 1.233509 .0440868 5.87 0.000 1.149731 1.323391
            |
            racenew |
            Black/African American | 1.197505 .0579165 3.73 0.000 1.088786 1.317079
            American Indian/Alaskan Native | 1.336219 .2217366 1.75 0.082 .9639491 1.852255
            Asian | .939724 .073838 -0.79 0.429 .8050947 1.096866
            Race Group Not Releasable | 1.308207 .5860116 0.60 0.549 .5417963 3.158762
            Multiple Race | 1.690656 .173228 5.12 0.000 1.38193 2.068351
            |
            educ |
            Never attended/kindergarten only | .9190259 .0895282 -0.87 0.387 .7587013 1.113229
            Grade 1 | .9648247 .1321644 -0.26 0.794 .736845 1.263341
            Grade 2 | .8257307 .1205163 -1.31 0.191 .6195857 1.100463
            Grade 3 | .7082657 .1029089 -2.37 0.018 .5321305 .9427017
            Grade 4 | .8649503 .1163815 -1.08 0.282 .6637364 1.127163
            Grade 5 | .7017271 .0879838 -2.83 0.005 .5482905 .8981022
            Grade 6 | .5763156 .071513 -4.44 0.000 .4514502 .7357172
            Grade 7 | .7173123 .0980963 -2.43 0.016 .5480629 .9388284
            Grade 8 | .5182062 .0615773 -5.53 0.000 .4101536 .6547246
            Grade 9 | .5335597 .0644586 -5.20 0.000 .4206627 .6767557
            Grade 10 | .4682343 .0586438 -6.06 0.000 .3659511 .5991055
            Grade 11 | .4924161 .0618692 -5.64 0.000 .3845479 .630542
            12th grade, no diploma | .3893954 .0664601 -5.53 0.000 .2783066 .5448265
            High school graduate | .3956555 .0366853 -10.00 0.000 .3296657 .4748545
            GED or equivalent | .5451142 .0777292 -4.26 0.000 .4117379 .7216957
            Some college, no degree | .5598879 .0501424 -6.48 0.000 .4694187 .6677928
            AA degree: technical/vocational/occu.. | .5756934 .0632236 -5.03 0.000 .4638016 .7145789
            AA degree: academic program | .6506617 .0846485 -3.30 0.001 .5036962 .8405079
            Bachelor's degree (BA,AB,BS,BBA) | .5391694 .0483946 -6.88 0.000 .4518704 .6433341
            Master's degree (MA,MS,Med,MBA) | .6040899 .0625656 -4.87 0.000 .4927034 .7406579
            Professional (MD,DDS,DVM,JD) | .7371211 .1650444 -1.36 0.174 .4744379 1.145245
            Doctoral degree (PhD, EdD) | .5032981 .127927 -2.70 0.007 .3052058 .8299613
            Unknown--refused | .133554 .1366766 -1.97 0.050 .0178248 1.000665
            Unknown--not ascertained | .2503008 .2548058 -1.36 0.175 .0337622 1.855642
            Unknown--don't know | .6205987 .2317481 -1.28 0.202 .2976199 1.294076
            |
            age | .9974352 .00113 -2.27 0.024 .9952139 .9996614
            incfam07on | .9948032 .0014612 -3.55 0.000 .9919317 .9976829
            _cons | .1634589 .0103776 -28.53 0.000 .1442611 .1852114
            ---------------------------------------------------------------------------------------------------------
            Note: _cons estimates baseline odds.

            Comment


            • #7
              I suspect that in all other variables you have a value that should have been a missing value, that acts as a reference category. So your problem is actually completely reverse: your other variables are problematic, and gender is the only correct variable...
              ---------------------------------
              Maarten L. Buis
              University of Konstanz
              Department of history and sociology
              box 40
              78457 Konstanz
              Germany
              http://www.maartenbuis.nl
              ---------------------------------

              Comment


              • #8
                Maarten, do you know how I should go about fixing this problem? Can you clarify what you mean by "missing value"?

                Comment


                • #9
                  first, please post within CODE blocks (read the FAQ if you don't know what I mean) as your results are very hard to read

                  second, take one of your other categorical variables and -tabulate- it using the "missing" option (see "help tabulate oneway" if you don't understand); to the extent that I can read what you posted, it appears that you have included several categories that are really missing values (e.g., the three "unknown" lines in your output at the bottom of what appears to be education levels)

                  Comment


                  • #10
                    Maarten is suggesting that there is some coding in your data that includes a value that you are not aware of, and that value is becoming the base level, and that is why all the values - that you know of - are being shown.

                    For example, if some missing values in your data had been recoded to zero, zero would likely be chosen as the base value, and all the codes for non-missing values would appear in the results.

                    I will admit I'm a bit uncertain about this because I see that for educ your coding includes several "unknown" values that often would be treated as missing values, with those observations omitted from the analysis.

                    Your assertion that all your other categorical variables are fully represented does not appear to be correct. For example, your racenew variable includes coefficient estimates for the following categories:
                    Black/African American
                    American Indian/Alaskan Native
                    Asian
                    Race Group Not Releasable
                    Multiple Race
                    However, the documentation for at least one version of NHIS contains the following codes for racerpi2
                    01 White only
                    02 Black/African American only
                    03 AIAN only
                    04 Asian only
                    05 Race group not releasable (See file layout)
                    06 Multiple race
                    and the similarity of your category names to these suggests that "White only" was the base category for racenew.

                    I'll admit to further uncertainty here, because perhaps using svy: may have some effect here, but I am not a user of svy: to know what side effects it may have.

                    I'd suggest you try
                    Code:
                    tab acuseyr, missing
                    svy: tab acuseyr, missing
                    svy: logistic eczema i.acuseyr i.sex i.racenew i.educ age incfam07on
                    svy: tab acuseyr if e(sample), missing
                    tab acuseyr if e(sample), missing
                    and that might give us more to work with.

                    To assure maximum readability of results that you post, please copy them from the Results window into a code block in the Forum editor using code delimiters [CODE] and [/CODE], as explained in section 12 of the Statalist FAQ linked to at the top of the page. For example, the following:

                    [CODE]
                    . sysuse auto, clear
                    (1978 Automobile Data)

                    . describe make price

                    storage display value
                    variable name type format label variable label
                    -----------------------------------------------------------------
                    make str18 %-18s Make and Model
                    price int %8.0gc Price
                    [/CODE]

                    will be presented in the post as the following:
                    Code:
                    . sysuse auto, clear
                    (1978 Automobile Data)
                    
                    . describe make price
                    
                                  storage   display    value
                    variable name   type    format     label      variable label
                    -----------------------------------------------------------------
                    make            str18   %-18s                 Make and Model
                    price           int     %8.0gc                Price
                    which greatly improves readability.

                    Comment


                    • #11
                      As an alternative to tab you can also use Ben Jann's fre. The output from fre gives a bit more useful details; both the values and value labels, and by default the missing values. I use fre all the time for exactly this purpose. To get fre you type in Stata ssc install fre .
                      ---------------------------------
                      Maarten L. Buis
                      University of Konstanz
                      Department of history and sociology
                      box 40
                      78457 Konstanz
                      Germany
                      http://www.maartenbuis.nl
                      ---------------------------------

                      Comment


                      • #12
                        You will find it informative to have Stata display the reference levels in the table of parameter estimates (as both Rich and I suggested). Consider the following example:

                        Code:
                        .  webuse lbw
                        (Hosmer & Lemeshow data)
                        
                        . logistic low age lwt i.race i.smoke
                        
                        Logistic regression                                     Number of obs =    189
                                                                                LR chi2(5)    =  20.08
                                                                                Prob > chi2   = 0.0012
                        Log likelihood = -107.29639                             Pseudo R2     = 0.0856
                        
                        ------------------------------------------------------------------------------
                                 low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
                        -------------+----------------------------------------------------------------
                                 age |   .9777443   .0334083    -0.66   0.510     .9144097    1.045466
                                 lwt |   .9875761    .006305    -1.96   0.050     .9752956    1.000011
                                     |
                                race |
                              Black  |   3.425372   1.771281     2.38   0.017     1.243215    9.437768
                              Other  |     2.5692   1.069301     2.27   0.023     1.136391    5.808555
                                     |
                               smoke |
                             Smoker  |   2.870346    1.09067     2.77   0.006        1.363    6.044672
                               _cons |   1.391144   1.540841     0.30   0.766     .1586994    12.19464
                        ------------------------------------------------------------------------------
                        Note: _cons estimates baseline odds.
                        Stata is selecting a reference level for race and smoke but not explicitly telling us which level it is selecting. We can guess, and if we are very familiar with our data then it will be a good guess but I think it's much more informative is we have Stata explicitly report which level it is using as the reference category.

                        Code:
                        . logistic low age lwt i.race i.smoke, baselevels
                        
                        Logistic regression                                     Number of obs =    189
                                                                                LR chi2(5)    =  20.08
                                                                                Prob > chi2   = 0.0012
                        Log likelihood = -107.29639                             Pseudo R2     = 0.0856
                        
                        ------------------------------------------------------------------------------
                                 low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
                        -------------+----------------------------------------------------------------
                                 age |   .9777443   .0334083    -0.66   0.510     .9144097    1.045466
                                 lwt |   .9875761    .006305    -1.96   0.050     .9752956    1.000011
                                     |
                                race |
                              White  |          1  (base)
                              Black  |   3.425372   1.771281     2.38   0.017     1.243215    9.437768
                              Other  |     2.5692   1.069301     2.27   0.023     1.136391    5.808555
                                     |
                               smoke |
                          Nonsmoker  |          1  (base)
                             Smoker  |   2.870346    1.09067     2.77   0.006        1.363    6.044672
                                     |
                               _cons |   1.391144   1.540841     0.30   0.766     .1586994    12.19464
                        ------------------------------------------------------------------------------
                        Note: _cons estimates baseline odds.
                        You can either use the baselevels option to the logistic command or turn it on for all estimation commands using

                        Code:
                        . set showbaselevels on, permanently

                        William wrote: Maarten is suggesting that there is some coding in your data that includes a value that you are not aware of, and that value is becoming the base level, and that is why all the values - that you know of - are being shown.

                        For example, if some missing values in your data had been recoded to zero, zero would likely be chosen as the base value, and all the codes for non-missing values would appear in the results.
                        By having Stata show the baselevels you can see exactly what Stata is doing.

                        Comment


                        • #13
                          Hi all

                          I wonder if it is a "must" to use "i." with a categorical covariate in a logistic regression model in Stata when we are not interested in the effect of this categorical variable on the outcome variable. My analysis shows completely different results when I delete the " i." before a categorical covariate. When I do not use the " i.", the association between the outcome and main dependent variable is significant, but when I delete it, the association is significant. Any recommended reading would be of great help.
                          Last edited by Maryam Ghasemi; 12 Sep 2022, 17:33.

                          Comment


                          • #14
                            Using a categorical variable without the "i." in a logistic regression actually imposes a restriction on the way it is related to the outcome: the odds of the outcome occurring when the variable takes a value of say, 3, relative to the odds when it takes a value of 2 is the same as the odds ratio when it takes a value of 2 relative to 1, is the same as the odds ratio when it takes a value of 1 relative to 0. This restriction would rarely makes sense if this is not actually a cardinal variable.

                            Comment


                            • #15

                              Thank you for the reply. May categorical variable is not ordinal, so you suggest that it is "must" in my case?! Could you please let me know of any resource explaining about this topic and when it is possible or not possible to delete the ".i" ?

                              Comment

                              Working...
                              X