How to include all categories of a categorical co-variate in a logistical regression

Sabrina Khan

Join Date: Aug 2021

Posts: 8
#1

How to include all categories of a categorical co-variate in a logistical regression

24 Aug 2021, 12:48

Hi All,

I am fairly new to STATA so I apologize if this question may come off as silly. I am currently trying to run an analysis using the 2012 NHIS Adult Alternative Medicine File survey. In particular, I am trying to run a logistic regression using sex as one of my co-variates (independent variables). However, every time I enter the following command:

svy: logistic eczema i.acuseyr i.sex i.racenew i.educ age incfam07on

Only female sex comes up in my regression, despite there being many responses from males as well. I have noticed that "i. sex" does this in every single regression I run. I have attached a sample of a simplistic regression below (also includes the 2 way table as well):

any ideas on how I can include males in my regression as well?

Thank you!
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4465
#2

24 Aug 2021, 13:03

the way to do requires two changes:

Code:

logistic eczema ibn.sex, nocons

the ibn tells Stata not to use a base level but then you need the "nocons" option to avoid the "dummy variable trap"; see

Code:

help fvvarlist

of course, maybe what you really want is to show males as the base level but keep the constant; in that case, add the baselevels option to your command as shown above

by the way, please read the FAQ on the best way to post results
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#3

24 Aug 2021, 13:08

That is as it should be. The males are the reference category. An effect is a comparison. In your case you found that the odds of getting eczema is 1.2 times larger than that odds for males. If you know that then you cannot get a separate estimate for the effect of being male: if females have a 1.2 times higher odds than males then males have a 1/1.2 =0.83 times smaller odds than females. Since this is a completely deterministic relationship, there is nothing to estimate, so no statistics program can estimate it, nor would it be desirable.

For presentation purposes you might want to show predicted probabilities for males and females. That is possible, since predicted probabilities are not comparisons. You would use margins for that.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#4

24 Aug 2021, 13:09

It's not clear what you mean by " include males in my regression". They are included in the estimation (have a look at the number of observations).

The estimated odds ratio labelled "females" is the estimated odds of the outcome for females divided by the estimated odds of the outcome for males.

Do the following in your calculator (numbers extracted from the 2 by 2 tables) and you will get 1.209 (same as the estimated OR from the logistic model).

Code:

(20309*2374)/(1706*23372)

If you issue the command

Code:

set showbaselevels on

then Stata will include a row in the table of parameter estimates for males (the reference level).

I don't believe this is Stata specific. In my experience, almost all software gives the same table of parameter estimates.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#5

24 Aug 2021, 13:26

My and Rich's answer seem to contradict one another, but they do not. Rich's suggestion to remove the constant and add the males is similar to my suggestion to show predicted probabilities. Both do not result in effects, as effects imply a comparison. Rich's solution gives you the predicted odds for men and women.

Also see this Stata tip: http://maartenbuis.nl/publications/ref_cat.html

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Sabrina Khan

Join Date: Aug 2021

Posts: 8
#6

24 Aug 2021, 14:33

Hi All!

Thank you for your help. In hopes of clarifying what I mean, for all my other categorical variables, I am given a predicted odds ratio for "every category" or every possible answer choice for the question asked by the survey, except for males under sex. I have tried your method rich but I seem to get very different numbers if I run it that way. I think I'm just a little confused about whether I should run it with base levels or not, or if there's a way to include the category of males in the table shown below?

Please let me know if you need more info!

. svy: logistic eczema i.acuseyr i.sex i.racenew i.educ age incfam07on
(running logistic on estimation sample)

Survey: Logistic regression

Number of strata = 300 Number of obs = 47,761
Number of PSUs = 600 Population size = 137,181,562
Design df = 300
F( 35, 266) = 11.91
Prob > F = 0.0000

---------------------------------------------------------------------------------------------------------
| Linearized
eczema | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
----------------------------------------+----------------------------------------------------------------
acuseyr |
No | 1.542349 .1650304 4.05 0.000 1.249498 1.903836
Yes | 1.691821 1.296713 0.69 0.493 .3743689 7.645555
|
sex |
Female | 1.233509 .0440868 5.87 0.000 1.149731 1.323391
|
racenew |
Black/African American | 1.197505 .0579165 3.73 0.000 1.088786 1.317079
American Indian/Alaskan Native | 1.336219 .2217366 1.75 0.082 .9639491 1.852255
Asian | .939724 .073838 -0.79 0.429 .8050947 1.096866
Race Group Not Releasable | 1.308207 .5860116 0.60 0.549 .5417963 3.158762
Multiple Race | 1.690656 .173228 5.12 0.000 1.38193 2.068351
|
educ |
Never attended/kindergarten only | .9190259 .0895282 -0.87 0.387 .7587013 1.113229
Grade 1 | .9648247 .1321644 -0.26 0.794 .736845 1.263341
Grade 2 | .8257307 .1205163 -1.31 0.191 .6195857 1.100463
Grade 3 | .7082657 .1029089 -2.37 0.018 .5321305 .9427017
Grade 4 | .8649503 .1163815 -1.08 0.282 .6637364 1.127163
Grade 5 | .7017271 .0879838 -2.83 0.005 .5482905 .8981022
Grade 6 | .5763156 .071513 -4.44 0.000 .4514502 .7357172
Grade 7 | .7173123 .0980963 -2.43 0.016 .5480629 .9388284
Grade 8 | .5182062 .0615773 -5.53 0.000 .4101536 .6547246
Grade 9 | .5335597 .0644586 -5.20 0.000 .4206627 .6767557
Grade 10 | .4682343 .0586438 -6.06 0.000 .3659511 .5991055
Grade 11 | .4924161 .0618692 -5.64 0.000 .3845479 .630542
12th grade, no diploma | .3893954 .0664601 -5.53 0.000 .2783066 .5448265
High school graduate | .3956555 .0366853 -10.00 0.000 .3296657 .4748545
GED or equivalent | .5451142 .0777292 -4.26 0.000 .4117379 .7216957
Some college, no degree | .5598879 .0501424 -6.48 0.000 .4694187 .6677928
AA degree: technical/vocational/occu.. | .5756934 .0632236 -5.03 0.000 .4638016 .7145789
AA degree: academic program | .6506617 .0846485 -3.30 0.001 .5036962 .8405079
Bachelor's degree (BA,AB,BS,BBA) | .5391694 .0483946 -6.88 0.000 .4518704 .6433341
Master's degree (MA,MS,Med,MBA) | .6040899 .0625656 -4.87 0.000 .4927034 .7406579
Professional (MD,DDS,DVM,JD) | .7371211 .1650444 -1.36 0.174 .4744379 1.145245
Doctoral degree (PhD, EdD) | .5032981 .127927 -2.70 0.007 .3052058 .8299613
Unknown--refused | .133554 .1366766 -1.97 0.050 .0178248 1.000665
Unknown--not ascertained | .2503008 .2548058 -1.36 0.175 .0337622 1.855642
Unknown--don't know | .6205987 .2317481 -1.28 0.202 .2976199 1.294076
|
age | .9974352 .00113 -2.27 0.024 .9952139 .9996614
incfam07on | .9948032 .0014612 -3.55 0.000 .9919317 .9976829
_cons | .1634589 .0103776 -28.53 0.000 .1442611 .1852114
---------------------------------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#7

24 Aug 2021, 15:14

I suspect that in all other variables you have a value that should have been a missing value, that acts as a reference category. So your problem is actually completely reverse: your other variables are problematic, and gender is the only correct variable...

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Sabrina Khan

Join Date: Aug 2021

Posts: 8
#8

24 Aug 2021, 17:49

Maarten, do you know how I should go about fixing this problem? Can you clarify what you mean by "missing value"?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4465
#9

24 Aug 2021, 19:05

first, please post within CODE blocks (read the FAQ if you don't know what I mean) as your results are very hard to read

second, take one of your other categorical variables and -tabulate- it using the "missing" option (see "help tabulate oneway" if you don't understand); to the extent that I can read what you posted, it appears that you have included several categories that are really missing values (e.g., the three "unknown" lines in your output at the bottom of what appears to be education levels)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

24 Aug 2021, 19:20

Maarten is suggesting that there is some coding in your data that includes a value that you are not aware of, and that value is becoming the base level, and that is why all the values - that you know of - are being shown.

For example, if some missing values in your data had been recoded to zero, zero would likely be chosen as the base value, and all the codes for non-missing values would appear in the results.

I will admit I'm a bit uncertain about this because I see that for educ your coding includes several "unknown" values that often would be treated as missing values, with those observations omitted from the analysis.

Your assertion that all your other categorical variables are fully represented does not appear to be correct. For example, your racenew variable includes coefficient estimates for the following categories:
Black/African American
American Indian/Alaskan Native
Asian
Race Group Not Releasable
Multiple Race
However, the documentation for at least one version of NHIS contains the following codes for racerpi2
01 White only
02 Black/African American only
03 AIAN only
04 Asian only
05 Race group not releasable (See file layout)
06 Multiple race
and the similarity of your category names to these suggests that "White only" was the base category for racenew.

I'll admit to further uncertainty here, because perhaps using svy: may have some effect here, but I am not a user of svy: to know what side effects it may have.

I'd suggest you try

Code:

tab acuseyr, missing svy: tab acuseyr, missing svy: logistic eczema i.acuseyr i.sex i.racenew i.educ age incfam07on svy: tab acuseyr if e(sample), missing tab acuseyr if e(sample), missing

and that might give us more to work with.

To assure maximum readability of results that you post, please copy them from the Results window into a code block in the Forum editor using code delimiters [CODE] and [/CODE], as explained in section 12 of the Statalist FAQ linked to at the top of the page. For example, the following:

[CODE]
. sysuse auto, clear
(1978 Automobile Data)

. describe make price

storage display value
variable name type format label variable label
-----------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
[/CODE]

will be presented in the post as the following:

Code:

. sysuse auto, clear (1978 Automobile Data) . describe make price storage display value variable name type format label variable label ----------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price

which greatly improves readability.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#11

25 Aug 2021, 01:12

As an alternative to tab you can also use Ben Jann's fre. The output from fre gives a bit more useful details; both the values and value labels, and by default the missing values. I use fre all the time for exactly this purpose. To get fre you type in Stata ssc install fre .

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Paul Dickman

Join Date: Apr 2014
Posts: 294

#12

25 Aug 2021, 01:42

You will find it informative to have Stata display the reference levels in the table of parameter estimates (as both Rich and I suggested). Consider the following example:

Code:

.  webuse lbw
(Hosmer & Lemeshow data)

. logistic low age lwt i.race i.smoke

Logistic regression                                     Number of obs =    189
                                                        LR chi2(5)    =  20.08
                                                        Prob > chi2   = 0.0012
Log likelihood = -107.29639                             Pseudo R2     = 0.0856

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9777443   .0334083    -0.66   0.510     .9144097    1.045466
         lwt |   .9875761    .006305    -1.96   0.050     .9752956    1.000011
             |
        race |
      Black  |   3.425372   1.771281     2.38   0.017     1.243215    9.437768
      Other  |     2.5692   1.069301     2.27   0.023     1.136391    5.808555
             |
       smoke |
     Smoker  |   2.870346    1.09067     2.77   0.006        1.363    6.044672
       _cons |   1.391144   1.540841     0.30   0.766     .1586994    12.19464
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Stata is selecting a reference level for race and smoke but not explicitly telling us which level it is selecting. We can guess, and if we are very familiar with our data then it will be a good guess but I think it's much more informative is we have Stata explicitly report which level it is using as the reference category.

Code:

. logistic low age lwt i.race i.smoke, baselevels

Logistic regression                                     Number of obs =    189
                                                        LR chi2(5)    =  20.08
                                                        Prob > chi2   = 0.0012
Log likelihood = -107.29639                             Pseudo R2     = 0.0856

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9777443   .0334083    -0.66   0.510     .9144097    1.045466
         lwt |   .9875761    .006305    -1.96   0.050     .9752956    1.000011
             |
        race |
      White  |          1  (base)
      Black  |   3.425372   1.771281     2.38   0.017     1.243215    9.437768
      Other  |     2.5692   1.069301     2.27   0.023     1.136391    5.808555
             |
       smoke |
  Nonsmoker  |          1  (base)
     Smoker  |   2.870346    1.09067     2.77   0.006        1.363    6.044672
             |
       _cons |   1.391144   1.540841     0.30   0.766     .1586994    12.19464
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

You can either use the baselevels option to the logistic command or turn it on for all estimation commands using

Code:

. set showbaselevels on, permanently

William wrote: Maarten is suggesting that there is some coding in your data that includes a value that you are not aware of, and that value is becoming the base level, and that is why all the values - that you know of - are being shown.

For example, if some missing values in your data had been recoded to zero, zero would likely be chosen as the base value, and all the codes for non-missing values would appear in the results.

By having Stata show the baselevels you can see exactly what Stata is doing.

Comment

Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#13

12 Sep 2022, 17:29

Hi all

I wonder if it is a "must" to use "i." with a categorical covariate in a logistic regression model in Stata when we are not interested in the effect of this categorical variable on the outcome variable. My analysis shows completely different results when I delete the " i." before a categorical covariate. When I do not use the " i.", the association between the outcome and main dependent variable is significant, but when I delete it, the association is significant. Any recommended reading would be of great help.

Last edited by Maryam Ghasemi; 12 Sep 2022, 17:33.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1402
#14

12 Sep 2022, 17:44

Using a categorical variable without the "i." in a logistic regression actually imposes a restriction on the way it is related to the outcome: the odds of the outcome occurring when the variable takes a value of say, 3, relative to the odds when it takes a value of 2 is the same as the odds ratio when it takes a value of 2 relative to 1, is the same as the odds ratio when it takes a value of 1 relative to 0. This restriction would rarely makes sense if this is not actually a cardinal variable.
1 like
Comment
Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#15

12 Sep 2022, 22:29

Thank you for the reply. May categorical variable is not ordinal, so you suggest that it is "must" in my case?! Could you please let me know of any resource explaining about this topic and when it is possible or not possible to delete the ".i" ?
Comment

Announcement