problem with empty cell

desi suantari

Join Date: Jan 2018

Posts: 12
#1

problem with empty cell

19 Jan 2018, 10:11

dear all,

I have a logistic regression model with interaction i.compliance#i.age
compliance category : 1 2 3 4 (baseline: compliance 1)
age category : 1 2 3 [baseline: age 1)
my dependent variable is utilization of SBA (code 0 [no] and 1 [yes])

I have a problem with (empty cell)
what should I do to fix my problem? Im sorry I dont know why I cant attach the file.

logit SBA i.kepatuhan i.usia i.pendidikan i.pekerjaan i.keputusan i.kuintil i.paritas i.komplikasi i.residen i.asuransi i.planning i.kepatu
> han#i.usia, or

note: 2.kepatuhan#2.usia != 0 predicts success perfectly
2.kepatuhan#2.usia dropped and 42 obs not used

note: 2.kepatuhan#3.usia != 0 predicts success perfectly
2.kepatuhan#3.usia dropped and 1 obs not used

note: 4.kepatuhan#3.usia != 0 predicts success perfectly
4.kepatuhan#3.usia dropped and 11 obs not used

and this for the result of interaction

kepatuhan#usia |
tidak k4, 9T lengkap#20-35 tahun | 1 (empty)
tidak k4, 9T lengkap#<20 tahun | 1 (empty)
K4, 9T tidak lengkap#20-35 tahun | 1.291457 .6744692 0.49 0.624 .4640197 3.594378
K4, 9T tidak lengkap#<20 tahun | 2.604826 1.749696 1.43 0.154 .6982575 9.717216
K4, 9T#20-35 tahun | 13.85527 18.49688 1.97 0.049 1.012174 189.6596
K4, 9T#<20 tahun | 1 (empty)
|
_cons | 1.547874 2.147119 0.31 0.753 .1020939 23.46775
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

19 Jan 2018, 10:23

What Stata is telling you in the first message is that whenever kepatuhan == 2 & usia == 2, SBA is always 1. The other two messages say that the same thing happens for a couple of other combinations of kepatuhan and usia.

Logistic regression is estimated by maximum likelihood. And in a case where some value (or range of values) of a variable always has the outcome = 1 (or always has the outcome = 0), then, mathematically, the maximum likelihood estimate of the coefficient of that variable is infinity (or negative infinity). In other words, it is not possible for a logistic regression to include these variables and converge. So Stata's approach to this is to simply drop those variables and the cases affected by them from the model. You would then interpret the model as applying only to cases that do not have the values of these variables that were excluded. For cases with those values, you don't need a model because you know the result is always 1 (or, in other situations, always 0).

If that approach is unsatisfactory, you can use Joseph Coveney's -firthlogit-, available from SSC. This command uses penalized maximum likelihood, and it can produce finite coeffiient estimates in the face of perfect prediction.
1 like
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#3

19 Jan 2018, 15:36

thank you, Sir.
if I use your first advice, could I say that I dont need to change the model?
and when I want to see the OR of interaction, I dont need to count the OR of the variable that were excluded?
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#4

19 Jan 2018, 15:51

I dont exactly understand how to interpret the excluded variables, Sir.
could you please give me one example of them, especially for the OR?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

19 Jan 2018, 15:52

If you stick with -logit- then you do not need to change the model, and you can just disregard the ORs for variables that were omitted. But, and this is crucial, in reporting your findings, you must make it clear that the model does not apply to observations with values of kepatuhan and usia in the excluded combinations, and then explain in those combinations, the outcome is always 1. Without those additional statements you would be misrepresenting your model.

Added: Crossed with #4.

To be specific, let's focus on tidak k4, 9T lengkap#20-35 tahun as an example. This category of the interaction was omitted from the model due to perfect prediction of a 1 outcome. So you have to say:

1. In any case where kepatuhan = tidak k4, 9T lengkap and usia = 20-35 tahun, the logistic regression model does not apply. In such a case, SBA is always 1.

And, to be clear, there is no such thing as an OR associated with that. It is not part of the logistic regression model.

Last edited by Clyde Schechter; 19 Jan 2018, 15:56.
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#6

19 Jan 2018, 16:07

thank you very much, Sir !
It helps me a lot
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#7

19 Jan 2018, 16:25

one more thing Sir.
with this model, how can I see the OR of whatever.kepatuhan#>35 tahun (>35tahun : baseline) ?
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#8

19 Jan 2018, 16:33

am I right if I use lincom 2.kepatuhan+2.kepatuhan#1.usia? (usia 1 : code for >35 tahun) ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

19 Jan 2018, 17:35

am I right if I use lincom 2.kepatuhan+2.kepatuhan#1.usia? (usia 1 : code for >35 tahun) ?

That is the right approach for the OR of 2.kepatuhan vs 1.kepatuhan conditional on >35 tahun. However, as written it will give you the sum of the regression coefficients. To get it in the odds ratio metric, you need to specify the option -or- at the end of the -lincom- command (after a comma, of course).
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#10

01 Feb 2018, 15:08

Sir,
finally I did your recommendation to use -firthlogit- command and it worked well

my supervisor ask me "how can the -firthlogit- fill the empty cell?"
I need your explanation, Sir

thank you
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

01 Feb 2018, 15:45

-firthlogit- does not fill the empty cell. In fact, the empty cell was never really an issue in your model. It affected nothing that you did. Stata was merely pointing out as a courtesy to you that one of your cells was empty. -firthlogit- will ignore the empty cell, just as -logit- would ignore it, too. The reason you were advised to move to -firthlogit- was because you had variables that perfectly predicted the outcome. That's what these messages were about:

Code:

note: 2.kepatuhan#2.usia != 0 predicts success perfectly 2.kepatuhan#2.usia dropped and 42 obs not used note: 2.kepatuhan#3.usia != 0 predicts success perfectly 2.kepatuhan#3.usia dropped and 1 obs not used note: 4.kepatuhan#3.usia != 0 predicts success perfectly 4.kepatuhan#3.usia dropped and 11 obs not used

The elimination of these observations and variables may have then produced empty cells in your model. But the empty cells themselves were not, and generally are not, problems.

Let's take a step back and look at things from a more abstract level.

Regressions (logistic or otherwise) are based on models of data. The model says that the outcome variable, or rather the expectation of some transform of the outcome variable, is given by some type of function of the predictor variables (typically a linear combination), and the actual outcome is distributed around that expectation according to some parametric distribution. The function of the predictor variables itself has some parameters (i.e. the coefficients) whose values are to be estimated from the data.

So that's the model. The coefficients, variance components, etc. that are part of the function and the parametric distribution are free parameters of the model. Their values are not known: the range of possible values of all of those parameters defines a family (typically an infinite family) of specific relationships between outcome and predictors. The task is then to figure out what the values of those parameters are. These parameters are also sometimes called estimands, that is, things to be estimated.

How do we learn the values of the estimands? Well, we don't do so exactly. Instead we carry out some kind of calculations on the data. The particular calculations (algorithm) are designed to give us estimates of the estimands, and in most cases, there is a reasonably well established theory that tells us how those estimates relate to the true values of the estimands. For example, with ordinary least squares regression, the algorithm is some simple matrix algebra and elementary statistical theory tells us that the estimates produced by that matrix algebra are unbiased estimates of the estimands and also enables us to calculate the sampling variation of the estimates (standard errors).

In the case of logistic regression, the algorithm is not based on simple matrix calculations. Rather there is a function of the parameters called the likelihood function, which is basically: given assumed values of the parameters, and assuming the model is a valid specification of the data generating process, what is the probability of observing this particular set of data. This function of the parameters is called the likelihood function. One approach to estimating the estimands is to calculate estimators which maximize the likelihood function. These are called maximum likelihood estimators. There is theory that shows that under reasonable conditions, the sampling distribution of the maximum likelihood estimators is (asymptotically, in large samples) normal, with mean equal to the true values of the estimands. It also provides a calculation for the sampling variation (standard errors) based on something called the information matrix.

Now it is sometimes possible to have a model whose parameters can be estimated by different algorithms. For example, in ordinary linear regression, you could estimate the parameters using maximum likelihood. It turns out that for linear regression, the maximum likelihood estimates are equal to the estimates obtained by the least squares matrix algebra calculations, so nobody uses the more computational intensive maximum likelihood approach. But for logistic regression, the maximum likelihood algorithm is the simplest available. So it is the one that is ordinarily used. But maximum likelihood regression has a drawback: if one of the predictor variables can infallibly predict the outcome in the sample data, then the maximum likelihood estimate of that variable's coefficient is infinite (positive or negative). Since there is no way that the maximization algorithm can converge to an infinite value, Stata (and other statistical software) first looks for this situation, and if it finds it, it removes the offending predictor from the model and tries to fit the reduced model.

-firthlogit- estimates the parameters of the same model as -logit-. But it uses a different algorithm. Instead of using maximum likelihood to calculate estimands, it uses a modified procedure called penalized maximum likelihood. This modified procedure has the advantage that the estimates associated with variables that infallibly predict the outcome are finite, and therefore the algorithm can converge. So -firthlogit- and -logit- are two different algorithms for estimating the parameters of a logistic regression model. -firthlogit- and -logit- will produce similar, but not identical, results when challenged with data that both of them can handle. The advantage of -firthlogit- is that it can handle certain data situations, such as variables that perfectly predict outcome, that -logit- cannot.
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#12

15 Feb 2018, 02:24

hi Sir, thank you for your explanation yesterday..

now, I have new problem.. I have 2.986 respondents, in bivariate 2.986 respondents include in observation.. but in multivariate, only 2.937 respondents include in observation.. whats wrong Sir?

bivariat :

. firthlogit SBA i.planning, or

initial: penalized log likelihood = -677.31722
rescale: penalized log likelihood = -677.31722
Iteration 0: penalized log likelihood = -677.31722
Iteration 1: penalized log likelihood = -666.81693
Iteration 2: penalized log likelihood = -654.19632
Iteration 3: penalized log likelihood = -654.15091
Iteration 4: penalized log likelihood = -654.1509

Number of obs = 2,986
Wald chi2(2) = 52.35
Penalized log likelihood = -654.1509 Prob > chi2 = 0.0000

multivariat :

. firthlogit SBA i.kepatuhan i.usia i.pendidikan i.kerja i.keputusan i.kuintil i.paritas i.komplikasi i.residen i.wilayah i.asuransi i.planni
> ng, or

initial: penalized log likelihood = -611.63471
rescale: penalized log likelihood = -611.63471
Iteration 0: penalized log likelihood = -611.63471
Iteration 1: penalized log likelihood = -546.48061 (not concave)
Iteration 2: penalized log likelihood = -522.00997
Iteration 3: penalized log likelihood = -519.76683 (not concave)
Iteration 4: penalized log likelihood = -516.37983
Iteration 5: penalized log likelihood = -513.48118
Iteration 6: penalized log likelihood = -503.08028
Iteration 7: penalized log likelihood = -502.93367
Iteration 8: penalized log likelihood = -502.93361
Iteration 9: penalized log likelihood = -502.93361

Number of obs = 2,937
Wald chi2(27) = 173.11
Penalized log likelihood = -502.93361 Prob > chi2 = 0.0000
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#13

15 Feb 2018, 03:02

hi Sir, thank you for your explanation yesterday..

now, I have new problem.. I have 2.986 respondents, in bivariate 2.986 respondents include in observation.. but in multivariate, only 2.937 respondents include in observation.. whats wrong Sir?

bivariat :

. firthlogit SBA i.planning, or

initial: penalized log likelihood = -677.31722
rescale: penalized log likelihood = -677.31722
Iteration 0: penalized log likelihood = -677.31722
Iteration 1: penalized log likelihood = -666.81693
Iteration 2: penalized log likelihood = -654.19632
Iteration 3: penalized log likelihood = -654.15091
Iteration 4: penalized log likelihood = -654.1509

Number of obs = 2,986
Wald chi2(2) = 52.35
Penalized log likelihood = -654.1509 Prob > chi2 = 0.0000

multivariat :

. firthlogit SBA i.kepatuhan i.usia i.pendidikan i.kerja i.keputusan i.kuintil i.paritas i.komplikasi i.residen i.wilayah i.asuransi i.planni
> ng, or

initial: penalized log likelihood = -611.63471
rescale: penalized log likelihood = -611.63471
Iteration 0: penalized log likelihood = -611.63471
Iteration 1: penalized log likelihood = -546.48061 (not concave)
Iteration 2: penalized log likelihood = -522.00997
Iteration 3: penalized log likelihood = -519.76683 (not concave)
Iteration 4: penalized log likelihood = -516.37983
Iteration 5: penalized log likelihood = -513.48118
Iteration 6: penalized log likelihood = -503.08028
Iteration 7: penalized log likelihood = -502.93367
Iteration 8: penalized log likelihood = -502.93361
Iteration 9: penalized log likelihood = -502.93361

Number of obs = 2,937
Wald chi2(27) = 173.11
Penalized log likelihood = -502.93361 Prob > chi2 = 0.0000
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

15 Feb 2018, 12:21

-firthlogit-, like any other estimation command, excludes from the estimation sample any observation where any variable mentioned in the command has a missing value. So there must be some missing data among the variables you included in your model. Go look for them: if they are mistakes, fill them in. If not, then this is what your data will support, and you may want to look at options for dealing with missing data. That's a broad topic that would be too lengthy to discuss here.
Comment
desi suantari

Join Date: Jan 2018

Posts: 12
#15

15 Feb 2018, 23:48

my mistake Sir,
there are some missing data

thank you, Sir
Comment

Announcement

problem with empty cell

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment