How to deal with "dummy variable is omitted because of collinearity"

Josephine Nicolai

Join Date: Jun 2021

Posts: 20
#1

How to deal with "dummy variable is omitted because of collinearity"

28 Jun 2021, 08:26

Hi all, I doing a logit in Stata 15.0 but I got a note that says: omitted because of collinearity
I created dummy variables from double types. For example: the 3 stages of economic development, where stage 1 = 1, stage 2 = 2 and stage 3 = 3. I generated new dummies with the following commands (LED means Level of Economic Development):

generate LED_1 = 1 if LED == 1
replace LED_1 = 0 if LED != 1
generate LED_2 = 1 if LED == 2
replace LED_2 = 0 if LED != 2
generate LED_3 = 1 if LED == 3
replace LED_3 = 0 if LED != 3

Same applies for household income dummy and education dummy where I get the same note.
When regressing (using command logit) than this is the output: see attachment

What am I doing wrong or which commands should I use to solve this problem? Should I use a reference category, and if so, which command is used for that or can I just drop one of the variables?

I hope someone can help me! Thanks in advance,
Josephine

Attached Files
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1058
#2

28 Jun 2021, 08:45

This is normal. Think a two-level variable (e.g. dead or alive; male or female assigned at birth. etc.) Have you wondered why don't we put two dummies, one for dead = 1, alive = 0; one for alive = 1, dead = 0 into the regression model? It's because if we know who's alive, we'd know who's dead. Same happens to a 3-level variables. We don't need to know all three, we just need to know two. By omission we can find out the last level.

What am I doing wrong or which commands should I use to solve this problem?

Nothing technically wrong. Stata (and other statistical software) will just kick one out for you. Here it seems the last entered one was discarded. And that discarded group would become the reference group for that categorical variable. If you, say, rather want LED_1 as the reference, then in the regression statement just provide LED_2 and LED_3 (without LED_1) and it'd work.

Should I use a reference category, and if so, which command is used for that or can I just drop one of the variables?

Actually yes, in Stata, running regression does not require manually calculating dummies. Just learn how to use i. (see -help fvvarlist-). For example:

Code:

reg entrepreneur i.LED // This will change LED into a categorical variable in the model, as two dummies. reg entrepreneur ib2.LED // This will change level 2 (based on your numerical code) into reference group. reg entrepreneur ib2.LED, base // This "base" option will add the reference group back into list with a coefficient 0. This format is easier to interpret.
2 likes
Comment

Lewis Steell

Join Date: Mar 2020
Posts: 7

02 Dec 2022, 02:51

Apologies for piggy-backing onto this question but I'm experiencing a similar issue and struggling with the interpretation. When running a tobit regression model I have multimorbidity count as a covariate (factor variable 'mmcount_5cat': coded 0 "No multimorbidity" ,1 "2 long term conditions (LTC)" ,2 "3 LTC" ,3 "4 LTC", 4 ">4 LTC"). I am using factor notation to set the base group as 0 "No multimorbidity"

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float index double age long(sex clusters_mid) float mmcount_5cat double(townsend bmi)
.486 43 1 0 0   -3.9646 20.3826
 .95 54 2 0 0  -3.94309 27.5987
.809 53 1 0 0  -2.26327 24.3107
.674 52 1 0 0   3.78311 26.6556
   1 65 1 . 0  -3.15415 22.0889
.879 53 2 0 0  -4.03025 31.8457
.866 61 1 . 0  -3.95783 22.8834
.798 67 1 . 0   .493661 20.9868
   1 58 2 . 0  -5.19304 25.5018
.619 62 1 . 0   -1.4906 29.1215
   1 59 1 . 0  -2.37121 24.0385
.859 50 2 0 0  -1.71538 23.6276
   1 45 2 0 0   2.23657 30.0926
.937 66 2 . 0    .58871  26.609
   1 44 2 0 0   4.00405  28.266
.937 55 2 . 0  -3.48707 21.4619
.937 65 1 . 0  -2.37863 31.7333
   1 64 1 . 0  -2.18831 27.2656
.937 64 1 . 0  -2.53244 28.8447
.699 64 1 . 0  -4.37989 37.8401
.872 60 1 . 0    2.9948 22.2837
.937 47 2 0 0  -5.00357 21.8647
   1 46 1 0 0   .590737 34.8101
.798 60 2 . 0   .020252 31.8519
 .68 60 1 . 0  -3.74637 24.5826
   1 64 2 . 0 -.0928486 30.4809
   1 55 1 . 0  -4.32101  24.579
.505 60 1 . 0   7.34167 29.2378
   1 65 1 . 0  -1.31052 27.9136
.557 52 1 0 0    3.1569  32.811
.942 53 2 0 0  -1.28374 19.7559
.838 42 1 0 0   3.49754 21.9922
.937 59 2 . 0   2.30648 24.9925
.879 48 2 0 0  -2.22321 31.9903
.475 61 2 . 0  -4.18829 32.1458
.837 44 1 0 0  -1.82453 32.3437
.937 59 1 . 0  -2.98543 23.2735
.922 51 1 0 0   .137565  28.235
   1 60 2 . 0  -2.51108 24.9459
end
label values sex sex
label def sex 1 "1. Female", modify
label def sex 2 "2. Male", modify
label values clusters_mid clusters_mid
label def clusters_mid 0 "0. No multimorbidity", modify
label values mmcount_5cat mmcount5
label def mmcount5 0 "No multimorbidity", modify

Within my tobit model I am using factor notation to set 0 "No multimorbidity" as the base group. Code:

Code:

tobit index i.clusters_mid i.mmcount_5cat i.sex age townsend townsendsq townsendcu bmi bmisq bmicu i.ethnic i.smoki
> ng_new i.alcohol_frequency alcohol_weekly_units i.physicalact i.selfhealth_ord i.frailty_yesnopre, nolog base ul(1), 
> if age_cat ==1
note: 4.mmcount_5cat omitted because of collinearity.

Tobit regression                                  Number of obs     =   68,033
                                                         Uncensored =   44,412
Limits: Lower = -inf                                  Left-censored =        0
        Upper =    1                                 Right-censored =   23,621

                                                  LR chi2(37)       = 14626.75
                                                  Prob > chi2       =   0.0000
Log likelihood = -1428.7018                       Pseudo R2         =   0.8366

-----------------------------------------------------------------------------------------------------
                            EQindex | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
------------------------------------+----------------------------------------------------------------
                       clusters_mid |
              0. No multimorbidity  |          0  (base)
                 1. Hypertension +  |  -.1285672   .0087562   -14.68   0.000    -.1457294   -.1114051
                       2. Asthma +  |  -.1455662   .0087514   -16.63   0.000    -.1627189   -.1284134
                         3. Pain +  |  -.1749167   .0092896   -18.83   0.000    -.1931242   -.1567092
4. Mixed/discordant multimorbidity  |  -.1346043   .0093133   -14.45   0.000    -.1528582   -.1163503
                   5. Depression +  |  -.1787647   .0090132   -19.83   0.000    -.1964306   -.1610988
                                    |
                       mmcount_5cat |
                 No multimorbidity  |          0  (base)
                            2 LTCs  |    .109938   .0085317    12.89   0.000     .0932158    .1266602
                            3 LTCs  |   .0842832   .0088102     9.57   0.000     .0670151    .1015512
                            4 LTCs  |   .0623046   .0098425     6.33   0.000     .0430133    .0815959
                           >4 LTCs  |          0  (omitted)
                                    |
                                sex |
                         1. Female  |          0  (base)
                           2. Male  |   .0271027   .0015275    17.74   0.000     .0241087    .0300967
                                    |

You can see, the model removes 4.mmcount_5cat for collinearity. I'm wondering if this is because the 'no multimorbidity' group is represented in two of the predictor variables (0.clusters_mid and 0.mmcount_5cat are the same people, one variable represents count of conditions and one represents type of conditions). Is this the case or am I missing something? Unclear how to interpret the coefficients for mmcount_5cat with two omitted levels... Any help appreciated.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

02 Dec 2022, 07:42

I'm wondering if this is because the 'no multimorbidity' group is represented in two of the predictor variables (0.clusters_mid and 0.mmcount_5cat are the same people, one variable represents count of conditions and one represents type of conditions). Is this the case or am I missing something?

Yes, that's precisely what's happening here.

Unclear how to interpret the coefficients for mmcount_5cat with two omitted levels...

You can't. This model is simply malformed. You either need to eliminate the no multimorbidity level or the 0 level of mmcout_5cat. I'm not sure what the best way to do that is, because I can't quite figure out what the clusters_mid variable is. You have level 1 = hypertension +, and level 2 = asthma +, ... and level 4 = mixed/discordant multimorbidity. It seems to me that these are not, in any case, mutually exclusive categories--hence not suitable as levels for a single categorical construct. Why can't a person have both asthma and hypertension, thereby qualifying for both level 1 and level 2? And wouldn't that also automatically qualify them for level 4 as a mixed multimorbidity? Perhaps I am misunderstanding how this clusters_mid variable works, but if I have it right, then I think reworking this into four separate variables that are neither mutually exclusive nor exhaustive, representing the presence of something else + hypertension, asthma, pain, and depression, respectively, is the way to go here.
1 like
Comment
Lewis Steell

Join Date: Mar 2020

Posts: 7
#5

05 Dec 2022, 06:36

Thank you Clyde Schechter . For clarity, 'clusters_mid' is a nominal variable representing mutually exclusive multimorbidity clusters, identifed through latent class analysis. Individuals with multiple long term conditions were probailitistically assigned to one of the clusters based on their input multimorbidity profile, then the 'no multimorbidity' level was added as a reference group... The naming convention is simply based on the leading condition(s) within each cluster, where they exists (i.e. all but mixed/discordant).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

05 Dec 2022, 09:05

Thanks for clarifying. So I think the best way to handle your situation is to eliminate that no multimorbidity level from the clusters.
Comment
Lewis Steell

Join Date: Mar 2020

Posts: 7
#7

08 Dec 2022, 03:01

Thanks, Clyde... returning to this now, I'm wondering if the omission of a coefficient for 0.mmcount_5cat is actually a problem for my interpretation of the model. I am primarily interested in the differences between the clusters_mid categories compared to the no multimorbidity group, for which all of my coefficients appear sensible. It's of course sensible that a coefficient cannot be generated to adjust for no multimorbidity as this would include precisely 0 individuals from each of my multimorbidity clusters. Can I interpret the coefficients for clusters_mid as they are, or is the model simply incorrect?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

08 Dec 2022, 10:09

The model is incorrect. The effect of having zero multimorbidities is being split, in an arbitrary way, between two different variables. And when you have a categorical variable, if any of the categories is improperly specified, then all of the comparisons relative to it are necessarily wrong as well. Now, considering the output shown in #3, I suspect that the way Stata has broken the colinearity has left you with correct results for the clusters_mid variable, where as the one for the count of conditions is wrong. But I don't feel certain about that.
1 like
Comment

Announcement