Question on logit with Categorical variables

Farida Khan

Join Date: Jul 2024

Posts: 1
#1

Question on logit with Categorical variables

17 Jul 2024, 18:31

I am very new to Stata and logit models. I have a dummy variable called illness which is my dependent variable. I am doing a logit regression on it with predictors that are one continuous variable (food expenditure), and a couple of other categorical variables - access to piped water and whether there is a shared latrine. i have a couple of questions:

1) is it okay to use my categorical variables without an i in front f they are scaled into 3 categories (e.g. higher values show greater access to piped water)

2) if I get coefficients that are large - eg. 1.97 giving me an odds ratio of 7.2, is that okay?

I have only about 200 observations as this is primary data I collected.

Thank you!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

17 Jul 2024, 19:44

1) is it okay to use my categorical variables without an i in front f they are scaled into 3 categories (e.g. higher values show greater access to piped water)

Usually not, but sometimes it is OK. The problem is that when you enter the variable without i., if the categories are coded as 1, 2, and 3, then you are constraining the model so that the odds ratio of illness for exposure levels 2 vs 1 is equal to the odds ratio for 3 vs 2, and the odds ratio for 3 vs 1 is the square of the odds ratio for 2 vs 1. In other words, the values 1, 2, and 3 are taken literally as numeric values. By contrast when you use the i., the variable is interpreted as simply having three arbitrary categories with no constraints on the relationships among the odds ratios between different pairs of categories. The numerical codes 1, 2, and 3 are not taken to be numbers, simply as numerical labels for three different states.

If you think that the constraints referred to above are realistic, then your model is actually more efficient if you omit the i. But if those constraints are not realistic, then you are mis-specifying the model and the results can be quite misleading. One way to look into this is to first run the model with i. Then look at the coefficients (not the odds ratios, the coefficients) for levels 1, 2, and 3. If the values of _b[1.var], _b[2.var], and _b[3.var] are equally spaced, or at least close enough to that for practical purposes, then the constraints would be reasonable using c. would simplify your model and enhance its efficiency.

2) if I get coefficients that are large - eg. 1.97 giving me an odds ratio of 7.2, is that okay?

If the coefficient is that of a continuous variable, you cannot interpret its magnitude without knowing the scale of the underlying variable. This odds ratio would be an odds ratio of 7.2 in association with a 1 unit change in the variable. If 1 unit changes actually occur, even frequently, an odds ratio of 7.2 would be astronomically large and you would bear a considerable burden of proof as to why it should be taken seriously and not as a sign of something seriously wrong with the analysis. On the other hand, if the entire variable ranges between 0.4 and 0.5, a unit change can never occur, and even the largest possible change is an order of magnitude smaller than 1. Accordingly, a maximal change of 0.1 (going from 0.4 to 0.5) would be associated with an odds ratio of 7.2^0.1 = 1.22 (to 2 decimal places), which is a perfectly reasonable odds ratio for a moderate or weak effect.

If the variable you refer to is a 0/1 dichotomous variable (which it necessarily is if you have i. in front of it) then this is like a continuous variable where 0 vs 1 changes (a unit change) are the currency of the realm. So 7.2 is an extremely large odds ratio and would be highly suspect. In real life, odds ratios for dichotomous variables are rarely greater than 4 (or less than 0.25). There are exceptions: the smoking:lung cancer odds ratio is about 10, and the asbestos:lung cancer odds ratio is around 20. But you just don't see things like this very often--and those ratios are only given credence because they have been reproduced in large numbers of well-designed studies. My rule of thumb is that odds ratios greater than 3 (or < 0.33) for dichotomous variables are suspicious. And any odds ratio > 4 (or < 0.25) for a dichotomous variable is an error until proven otherwise.

Sample size is quite relevant here. It is well known that logistic regression produces upwardly biased estimates for effects of rare variables. While 200 may be a respectable sample size as a whole, if one of your predictor variables is highly unbalanced, say 190 zeroes and 10 ones, then an odds ratio as large as 7.2 could arise just as a result of the bias that comes from this rarity. When you have rare effects like that, and if they are in the model because you are trying to estimate them, not just to adjust for their confounding effects, it is best not to use logistic regression. A linear probability model will produce less biased estimates (though its predicted probabilities may not be usable if probabilities close to 0 or 1 are likely). Or you can use penalized maximum likelihood logistic regression (-firthlogit-, by Joseph Coveney, available from SSC). The use of penalized maximum likelihood corrects for the rarity bias of simple logistic regression.

Last edited by Clyde Schechter; 17 Jul 2024, 19:47.
1 like
Comment

Announcement

Question on logit with Categorical variables

Comment