Omitted- predicts failure or success

sandeep kaur

Join Date: Jul 2022

Posts: 60
#1

Omitted- predicts failure or success

06 Aug 2022, 18:28

logistic Out Exp var_n var_m var_k var_k

var_n != 1 predicts success perfectly
var_n dropped and 1 obs not used

var_m != 1 predicts failure perfectly
var_mdropped and 18 obs not used

On running logistic regression with above two covariates, stat output for these covariates were omitted. There is no cell with 0 count on doing tab Out var_n and Out var_m.

What does it reflect? Any way to rectify it.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

06 Aug 2022, 20:19

If, after the logistic regression, you run

Code:

tab Out var_n if e(sample) tab Out var_m if e(sample)

You will see that Out is always equal to 1 whenever var_n is not equal to 1, and that Out is always equal to 0 when var_m is not equal to 1.

This kind of situation, where some value or combination of values of an explatory variable area always associated with the same value of the outcome variable (in your case it was with Out = 1, but it would be the same problem if Out were always 0.) is known as perfect prediction, and it is not possible to do logistic regression in this situation. The reason is that logistic regression estimates its parameters using maximum likelihood estimation, and the maximum likelihood estimate of the coefficient of var_n aor var_m would, in these cases be infinite. So there is no hope for convergence. Stata, rather than spinning its wheels endlessly checks for this situation in advance and, if it finds it, takes the steps that it has shown you above: it removes that variable and the associated problematic observations.

There may be several ways in which this situation can arise and you should check to see if any of them apply to you:

1. The data may be erroneous. In this case the solution is to fix the data.
2. There may be only a handful of observations for which var_n != 1, or var_m != 1 and as chance would have it, they just happen to always have Out = 1. (This seems likely in your case with regard to var_n, but much less so for var_m where there are 18 observations involved.) In this case, if it is possible to obtain a larger data set that has more observations with these conditions, but where the Out variable is not 1, that will resolve the problem.
3. There may be a systematic reason why the offending explanatory variables and the Outcome variable are related in this way. In this case, it makes no sense to include these variables in the model in the first place: if knowledge of one of these variables enables you to predict the outcome with certainty, then all other variables become irrelevant in that situation and having these variables in the model with them is pointless and self-contradictory. A more carefully thought-out model is the solution here.

In the event that none of these solutions applies, it is possible to fit logistic models to data like this using estimators that do not use maximum likelihood. For very small data sets, the -exlogistic- command will do this. More generally, -firthlogit-, by Joseph Coveney, available from SSC, uses penalized maximum likelihood estimation and is suitable for this kind of problem. Do not, however, automatically resort to one of these without first making a serious attempt to resolve the problem according to 1-3 above.

Added:

There is no cell with 0 count on doing tab Out var_n and Out var_m.

This is neither here nor there. The estimation sample always excludes any observations where any variable in the model has a missing value. Doing -tab Out var_whatever- does not give you the correct view of the data because it includes those. It may well be that if it were possible to include the observations that were omitted due to missing data, the perfect prediction would be overcome, but it is not possible to include those, so this is of no help. Doing the tabulations with the -if e(sample)- clause after the regression shows you the correct view of the data for this problem as it is based only on those observations that are eligible to contribute to the logistic regression.

Last edited by Clyde Schechter; 06 Aug 2022, 20:25.
1 like
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#3

06 Aug 2022, 20:58

Clyde Schechter

Thanks for great explanation. I have 15 covariates with mix of binary and categorical variables. Sample size is small. And have been encountering omitted values with some models.
it explains now. In this case seems like 3rd reason. Will remove variable..

1. var_n != 1 : Exclamation sign here means not equal.

2. Is there a way to handle missing observations. For example var1=50 obs, var2=45 obs, var3=50 obs. On running regression with these variables effect estimates are for 45 obs only, not taking into account complete set of obs.
The problem arises while comparing the model with less observations to model with complete set of obs. LR test cannot be performed.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#4

06 Aug 2022, 21:21

In Stata, the ! operator stands for logical negation. In particular, as you note, != means not equal.

If you have only 45 usable observations, you should not be fitting models with anywhere near 15 covariates! You are grossly overfitting the data when you do that and the results will have no generalizability. There are various rules of thumb as to how many observations you need per variable (and a categorical variable with n levels counts as n-1 variables because it takes n-1 indicators to represent it). But I think the most lenient rule of thumb is 10 observations per variable. So you would be limited to 4 explanatory variables under that rule. (Many people would say that this rule is far too lax, by the way.)

Missing data is always a problem and there are no really good solutions other than finding the correct values--which is usually not feasible in the real world. There are numerous approaches that are used for missing data, all of them are either known to produce misleading results in most circumstances, or produce reliable results only under assumptions that are always unverifiable and only sometimes plausible. You might want to read https://statisticalhorizons.com/wp-c...aterials-1.pdf for a good review.
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#5

07 Aug 2022, 11:58

Clyde Schechter
Thanks for your explanation. In this scenario
a) Should use 2-3 covariates which are clinically relevant
Or
b) Should stick to other statitsics like chi2/fisher exact. With only 55 sample size and many covariates regression is not wokring.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#6

07 Aug 2022, 12:15

I would generally favor a), with the understanding that the clinically relevant variables you choose are actually confounders. Remember that a variable C is a confounder of the A -> B relationship (-> denotes actual or suspected causation) if C -> B and the distribution of C is not independent of the value of A. If C -> B but C is independent of A (or in the situation where C is not independent of A but the causal direction is A -> C) then we do not have a confounder. In a large sample that can support regression with a large number of covariates, it is often advisable to include C in the model--it has the effect of reducing residual variance and thereby increasing power. Which is nice, but is less important than including confounders, because omitted confounders leads to biased estimates of the effects of interest. So when your data set is small and limits you to a small number of covariates, the true confounders are the priority. If it happens that you don't have any confounders, or only one, and your data will support 2 or 3 covariates, then you have the luxury to add non-confounders, once you have included the confounders. But in most real world situations the data offer many confounders, and those must be your first priority.
Comment
sandeep kaur

Join Date: Jul 2022

Posts: 60
#7

07 Aug 2022, 16:32

Clyde Schechter
Thanks for the great advice professor. It makes one wonder purpose of collecting information about so many variables at first instance. All covariates are gender-role related variables. With large sample size one can create gender score via factor analysis and propensity score method. But it won't work with small sample size.

I will use 3-4 important gender-role variables and may have to collapse few sub-categories.
Comment

Announcement

Omitted- predicts failure or success

Comment

Comment

Comment

Comment

Comment

Comment