completely determined

surendra sivakumar

Join Date: May 2015

Posts: 3
#1

completely determined

30 May 2015, 09:52

hi this is surendra, and i am trying to run a pooled logit model for my dissertation. when i run my model with a few variables , the model runs fine, but when i increase the number of variables , none of the values apart from the coefficient values are displayed and pseudo R square becomes 1. this is case of successes and failures determined completely. i have gone through the http://www.stata.com/support/faqs/st...ic-regression/ following page, however i am unable to understand this. can some one please please help me out with this problem ?
Tags: None
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#2

30 May 2015, 10:04

Your question would be easier to answer if you showed what you typed and exactly what Stata typed (or did) in response. See section 12 in the FAQ.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#3

30 May 2015, 13:11

Surendra:
following Friedrich's sound advice, can you please report after adding which predictor does the problem come alive?

Kind regards,
Carlo
(Stata 19.0)
Comment
surendra sivakumar

Join Date: May 2015

Posts: 3
#4

30 May 2015, 15:10

This the following command executed : (logit crisis rgdpchange grossi2gdp grossds2gdp govpb2gdp govdebt2gdp govdebt2rev govintpymt2rev cab2gdp extdebt ionextdebt M2 M2officialres liqrat liab2assets) these variables have been selected, from around a set of 40 variables. when i run the test for individual variables, the results are displayed. however when i run the above(include the variables which are statistically significant at 90%, the following occurs. (uploaded the screenshot). Your help will much appreciated. Thanks in advance sir.

1 Photo
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#5

30 May 2015, 15:35

First, please don't post screen shots on this forum. What you attached is just barely legible on my computer; I'm sure some others will find it completely unreadable--they usually are. The way to show code and results is by putting them in a code block--then they are guaranteed readable. See the FAQ for how to create a code block.

That said, Stata has explained to you in the output exactly what the problem is. You just need to interpret what it is telling you.

The note below the output says 61 failures and 11 successes completely determined. That means that some combination of your predictor variables manages to perfectly discriminate success from failure outcomes in your data. That is, something like "if var1 = 1 and var2 = 0 then outcome always = 0" is true. And as it happens, your sample size is 72, which means that there are no actual observations left whose outcome is not completely determined by those variables. That's why there are no standard errors or tests. You have exhausted your data with these exclusions. The other telltale sign that you have exhausted your data is near the top of the output where the df for the model LR chi square is shown as -1.

You will have to try removing variables from your model until you get down to a subset of predictors that do not completely determine the outcomes of a large subset of your data. This is to a large extent a process of trial and error. But you can use the coefficients in your current output to guide you. When a variable completely determines an outcome, the maximum likelihood estimate of its coefficient is infinite (positive or negative). This is manifested in your output by the outlandishly large coefficients of many of your variables. In the real world, logistic regression coefficients greater in magnitude than 3 are almost never seen except in this complete determination scenario. And, frankly, logistic regression coefficients greater in magnitude than 1.5 are really rather uncommon (and themselves often suggestive of something wrong with the model or the data).

You have a very large number of predictor variables for a data set of only 72 observations. It is likely that some combinations of those variables occur only rarely, and in a small sample it is then a high probability that by luck alone the outcome variable will be constant (i.e. completely determined) by that combination. If removing variables from the model is distasteful, the alternative is to get more data so that you have enough observations in all combinations of these variables that there will be outcome variation in all (or nearly all) of those. In that case, your model will run and will eliminate few or no observations as completely determined.

But basically you have too many variables for this few observations. Even if you had not run into this particular problem, your results would have been unreliable in any case. You need to get a higher ratio of observations to variables.

So I would start by eliminating all the variables that have unreasonably large coefficients (which is nearly all of them) in that output from your model. You can then try adding some of them back, one at a time and you might be able to include a few of them without incurring the same problem.
Comment
surendra sivakumar

Join Date: May 2015

Posts: 3
#6

30 May 2015, 15:49

sir, sorry for the attachment of screenshot. I was unaware. I shall take the fact into consideration.

Thankyou for the reply.

1. What is the adequate ratio of observations to variables? I shall first try to increase my data set, but could you tell me as to how many more would give me accurate results?

2. i had tried removing the variables with large magnitudes, (after removal of 2 variables) the results were displayed correctly, so would this mean that i can proceed with this model? or try and increase the observations of my model?

Last edited by surendra sivakumar; 30 May 2015, 16:10.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#7

30 May 2015, 16:33

1. What is the adequate ratio of observations to variables? I shall first try to increase my data set, but could you tell me as to how many more would give me accurate results?

There are various rules of thumb out there about this. The smallest one in common circulation is 10 observations per variable. I have seen people recommend 50 observations per variable. I don't think there is hard science or mathematics to back either of these rules, as I suspect that the adverse effects of insufficient observations depend a great deal on the specifics of the data.

The worry is that your model will overfit the data and will be primarily modeling the noise in your data rather than true relationships. Such a model will perform disastrously on replication. The AIC statistic (-estat ic- after running -logit-) is sometimes helpful in recognizing this situation. That said, your situation of just over 5 observations per data is likely to be problematic no matter what.

2. i had tried removing the variables with large magnitudes, (after removal of 2 variables) the results were displayed correctly, so would this mean that i can proceed with this model? or try and increase the observations of my model?

So you have reduced to 12 variables in 72 observations which gives you a ratio of 6:1. That's better than 5, of course, but still pretty skimpy. And the fact that you get a display of results doesn't mean that those results are credible. First of all, do you still have any really large coefficients (greater in magnitude than 3)? If so, I would say that you are still in a near-complete determination situation and you should continue to prune the model. Do you still have any coefficients greater in magnitude than 1.5? Those are suspicious and might also reflect an odd distribution of outcomes with that variable.

Is your model satisfactory in other respects? Does it discriminate well among the non-excluded observations? Is it reasonably well calibrated? (But if it's really very well calibrated, that is a red flag in your situation.) Are the standard errors you are seeing reasonable? Standard errors that are extremely large are another sign of trouble (of a different kind) for a model.
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#8

25 Nov 2020, 03:35

Hello,
In the case of comparing the number of variables to the number of observations, is the number of year fixed effects and industry fixed effects counted as: a variable for each year dummy and industry dummy ? for example if we have 10 years in the data and we include 9 dummies of each year, does that mean I should consider it as estimating 9 variables or not.
I thank you in advance.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#9

25 Nov 2020, 04:24

Marry:
I'm not sure I got your question right.
That said, if your are interested in estimating -i.years- and you have 10 years, Stata will give you back 9 coefficients (if no other reason for omission will come alive), as one year will be considered as reference category.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4546
#10

25 Nov 2020, 06:02

Marry Lee first, this is a different issue and should have been posted to a new topic; second, if I understand your question correctly, then yes, these are 9 parameters being estimated and for purpose of using guidelines/rules-of-thumb about events per variable they should be counted as separate (9 in your example) variables; but note the wide-ranging and somewhat inconsistent literature on "events per variable")
1 like
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#11

25 Nov 2020, 06:54

Thank you for your answer Carlo Lazzaro.
I am sorry I didn't explain my question well.
I found out that when estimating, we should make sure that the number of observations used is large enough compared to the number of variables used in the estimation.
Now I want to count the number of variables in my regression.
If I include year fixed effects, does that count as one variable or I should count how much year dummies will be in the regression?
That's my question.

Rich Goldstein Thank you for your answer.
I am sorry I did not start a new post.

but note the wide-ranging and somewhat inconsistent literature on "events per variable"

Does that mean I should not pay a lot of attention to this?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17851

#12

25 Nov 2020, 07:00

Marry:
as far as -i.year- is concerned, it will contribute to n-1 independent variables, as you can see from the following toy-example:

Code:

 use "https://www.stata-press.com/data/r16/nlswork.dta"
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. xtreg ln_wage c.age##c.age i.year, fe vce(cluster idcode)

Fixed-effects (within) regression               Number of obs     =     28,510
Group variable: idcode                          Number of groups  =      4,710

R-sq:                                           Obs per group:
     within  = 0.1162                                         min =          1
     between = 0.1078                                         avg =        6.1
     overall = 0.0932                                         max =         15

                                                F(16,4709)        =      79.11
corr(u_i, Xb)  = 0.0613                         Prob > F          =     0.0000

                             (Std. Err. adjusted for 4,710 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0728746    .013687     5.32   0.000     .0460416    .0997075
             |
 c.age#c.age |  -.0010113   .0001076    -9.40   0.000    -.0012224   -.0008003
             |
        year |
         69  |   .0647054   .0155249     4.17   0.000     .0342693    .0951415
         70  |   .0284423   .0264639     1.07   0.283    -.0234395     .080324
         71  |   .0579959   .0384111     1.51   0.131    -.0173078    .1332996
         72  |   .0510671   .0502675     1.02   0.310    -.0474808     .149615
         73  |   .0424104   .0624924     0.68   0.497    -.0801038    .1649247
         75  |   .0151376    .086228     0.18   0.861    -.1539096    .1841848
         77  |   .0340933   .1106841     0.31   0.758    -.1828994     .251086
         78  |   .0537334   .1232232     0.44   0.663    -.1878417    .2953084
         80  |   .0369475   .1473725     0.25   0.802    -.2519716    .3258667
         82  |   .0391687   .1715621     0.23   0.819    -.2971733    .3755108
         83  |    .058766   .1836086     0.32   0.749    -.3011928    .4187249
         85  |   .1042758   .2080199     0.50   0.616    -.3035406    .5120922
         87  |   .1242272   .2327328     0.53   0.594    -.3320379    .5804922
         88  |   .1904977   .2486083     0.77   0.444    -.2968909    .6778863
             |
       _cons |   .3937532   .2469015     1.59   0.111    -.0902893    .8777957
-------------+----------------------------------------------------------------
     sigma_u |  .40275174
     sigma_e |  .30127563
         rho |  .64120306   (fraction of variance due to u_i)
------------------------------------------------------------------------------

.

As you can see, the first figure of the F_test=16 (14 out of 16 coefficients come from -i.years-).

Kind regards,
Carlo
(Stata 19.0)

Comment

Marry Lee

Join Date: Nov 2020

Posts: 189
#13

25 Nov 2020, 08:04

Thank you for your answer Carlo Lazzaro.
That was really helpful.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4546
#14

25 Nov 2020, 08:10

re: #11 - you said,

Does that mean I should not pay a lot of attention to this?

no, it means you need to be aware of the issue and pre-plan how to answer critics; this usually depends on your substantive or scientific knowledge as well as any apparent issues with the results
1 like
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#15

25 Nov 2020, 09:02

Thank you very much for your answer Rich Goldstein !

It gives the problem of very large confidence intervals and almost no significant coefficients.
But there isn't much I can do because county fixed effects and stateXyear fixed effects have to be included in the regression and together they account for 57 variables.
So in total I have 83 variables while the number of observations is 240.
Comment

Announcement

completely determined

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment