logit: how to deal with "completely determined" case (while using interaction terms)

Mohsin Khan

Join Date: Jul 2015

Posts: 66
#16

29 Aug 2015, 04:49

Steve,

Thank you for your encouragement and detailed response. Due to time limitation, and the fact that I have done all relevant literature review + write up, changing the topic/data set does not seem like a luxury I can afford right now. Regarding your point 3 - change of non-events to events - would you by any chance have any publication in mind? I can study that and lend support to applying this procedure. I have a meeting with my supervisor coming Monday and I will discuss some of the options you suggested + the things I have tried so far.

Lastly, I suppose I should be wary of my results of xtgee as well. I have a CINO in 27 out of the 480 firm years. I suppose the results here would be "overfitting" as well.

Best,
Mohsin
Comment

Mohsin Khan

Join Date: Jul 2015
Posts: 66

#17

29 Aug 2015, 05:41

I did some further analysis with the xtgee option, and using robust I get the following results:

Code:

xtgee cino atob_1 ten_1 coo_1 tmt_1 fyear dc_1 ari_1 hhi_1 oc0_1 lemp_1 td_1, family(binomial 1) link(logit) corr(ar1) vce(robust) nolog

GEE population-averaged model                   Number of obs      =       480
Group and time vars:              id fyear      Number of groups   =        96
Link:                                logit      Obs per group: min =         5
Family:                           binomial                     avg =       5.0
Correlation:                         AR(1)                     max =         5
                                                Wald chi2(11)      =     36.70
Scale parameter:                         1      Prob > chi2        =    0.0001

                                     (Std. Err. adjusted for clustering on id)
------------------------------------------------------------------------------
             |               Robust
        cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      atob_1 |   .0084967    .270219     0.03   0.975    -.5211229    .5381162
       ten_1 |  -.0752956   .0493496    -1.53   0.127    -.1720191    .0214278
       coo_1 |  -.3792427   .8102774    -0.47   0.640    -1.967357    1.208872
       tmt_1 |   .3732676   .0847297     4.41   0.000     .2072006    .5393347
       fyear |   .2598641   .1514751     1.72   0.086    -.0370217    .5567499
        dc_1 |   .6400543   .6728775     0.95   0.341    -.6787613     1.95887
       ari_1 |   1.730946   5.717253     0.30   0.762    -9.474664    12.93656
       hhi_1 |   -.001132   .0011479    -0.99   0.324    -.0033818    .0011179
       oc0_1 |  -2.050357   .9198652    -2.23   0.026     -3.85326   -.2474548
      lemp_1 |  -.2630874    .307941    -0.85   0.393    -.8666407    .3404659
        td_1 |  -1.966205   .9827527    -2.00   0.045    -3.892365   -.0400453
       _cons |  -527.6172   305.1141    -1.73   0.084     -1125.63    70.39539
------------------------------------------------------------------------------

Without robust option, I get the same result except fyear is not significant anymore:

Code:

. xtgee cino atob_1 ten_1 coo_1 tmt_1 fyear dc_1 ari_1 hhi_1 oc0_1 lemp_1 td_1, family(binomial 1) link(logit) corr(ar1) nolog

GEE population-averaged model                   Number of obs      =       480
Group and time vars:              id fyear      Number of groups   =        96
Link:                                logit      Obs per group: min =         5
Family:                           binomial                     avg =       5.0
Correlation:                         AR(1)                     max =         5
                                                Wald chi2(11)      =     23.16
Scale parameter:                         1      Prob > chi2        =    0.0168

------------------------------------------------------------------------------
        cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      atob_1 |   .0084967   .2472919     0.03   0.973    -.4761866    .4931799
       ten_1 |  -.0752956   .0543156    -1.39   0.166    -.1817522     .031161
       coo_1 |  -.3792427   .8545886    -0.44   0.657    -2.054206     1.29572
       tmt_1 |   .3732676   .0894193     4.17   0.000     .1980089    .5485264
       fyear |   .2598641   .1793483     1.45   0.147    -.0916521    .6113803
        dc_1 |   .6400543   .6222863     1.03   0.304    -.5796044    1.859713
       ari_1 |   1.730946   2.733886     0.63   0.527    -3.627372    7.089264
       hhi_1 |   -.001132   .0010131    -1.12   0.264    -.0031177    .0008538
       oc0_1 |  -2.050357   1.140532    -1.80   0.072    -4.285758    .1850432
      lemp_1 |  -.2630874   .2387573    -1.10   0.271    -.7310431    .2048683
        td_1 |  -1.966205   1.114908    -1.76   0.078    -4.151385    .2189747
       _cons |  -527.6172   360.9528    -1.46   0.144    -1235.072    179.8374
------------------------------------------------------------------------------

However, when I used bootstrap option (this is based on one of the papers you suggested which said bootstrapping could also be used to refine the analysis in case of limited "events"), none of the results are significant:

Code:

. xtgee cino atob_1 ten_1 coo_1 tmt_1 fyear dc_1 ari_1 hhi_1 oc0_1 lemp_1 td_1, family(binomial 1) link(logit) corr(ar1) nolog vce(boot)
(running xtgee on estimation sample)


Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
x.xxxxxxx.xx..xx..xx..xxxxxxx.xxxx...xxx.x.xxxxxx.    50

GEE population-averaged model                   Number of obs      =       480
Group and time vars:              id fyear      Number of groups   =        96
Link:                                logit      Obs per group: min =         5
Family:                           binomial                     avg =       5.0
Correlation:                         AR(1)                     max =         5
                                                Wald chi2(11)      =      9.53
Scale parameter:                         1      Prob > chi2        =    0.5729

                                     (Replications based on 96 clusters in id)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
        cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      atob_1 |   .0084967   .5187571     0.02   0.987    -1.008249    1.025242
       ten_1 |  -.0752956   .1384705    -0.54   0.587    -.3466927    .1961015
       coo_1 |  -.3792427   5.964878    -0.06   0.949    -12.07019     11.3117
       tmt_1 |   .3732676   .2752734     1.36   0.175    -.1662583    .9127936
       fyear |   .2598641   .3678355     0.71   0.480    -.4610802    .9808084
        dc_1 |   .6400543   6.916288     0.09   0.926    -12.91562    14.19573
       ari_1 |   1.730946   13.98974     0.12   0.902    -25.68844    29.15033
       hhi_1 |   -.001132    .001228    -0.92   0.357    -.0035388    .0012749
       oc0_1 |  -2.050357   1.386162    -1.48   0.139    -4.767186    .6664706
      lemp_1 |  -.2630874   .6449271    -0.41   0.683    -1.527121    1.000946
        td_1 |  -1.966205   2.115764    -0.93   0.353    -6.113026    2.180616
       _cons |  -527.6172   738.7412    -0.71   0.475    -1975.523    920.2889
------------------------------------------------------------------------------
Note: one or more parameters could not be estimated in 35 bootstrap replicates;
      standard-error estimates include only complete replications.

Should I take the results of bootstrap as an indication that I am overfitting my model for xtgee, too?

Best,
Mohsin

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#18

29 Aug 2015, 14:54

The output from your original model states:

Code:

Note: 37 failures and 0 successes completely determined.

Yet you say that you had only four "failures" (cento = 1). So, which is it?

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#19

29 Aug 2015, 17:59

Correction: You actually said had eight events, not four.

Last edited by Steve Samuels; 29 Aug 2015, 18:04.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Mohsin Khan

Join Date: Jul 2015
Posts: 66

#20

30 Aug 2015, 04:27

Hi Steve,

Sorry, I mistyped 8. I meant 6. Please see below:

Code:

tab cino

   1= chief |
 innovation |
 officer is |
    present |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         88       93.62       93.62
          1 |          6        6.38      100.00
------------+-----------------------------------
      Total |         94      100.00

Comment

Mohsin Khan

Join Date: Jul 2015

Posts: 66
#21

30 Aug 2015, 04:29

And the number of groups in xtgee analysis is 96 because I dropped two firms in the logit analysis. This is because I am only looking at firms that had Cino for more than half of the years during the observation period (I am following what the other authors did here i.e. looking at "CINO prone" firms).
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#22

30 Aug 2015, 07:35

I misread the message in your first post--the "37 failures" refer to cases of cino = 0. All these models are severely overfitted. You can report the results, but must state that estimates are badly biased and , p-values, and CIs are inaccurate (exactly how inaccurate, I can't tell; the overfitting is much worse than in any of the published simulations.) This should be okay for a Master's Thesis, but the results, in my opinion, should not be published.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#23

30 Aug 2015, 08:09

Let's step back one more time: Is six the number of observations with cino= 1 in the GEE data set with n = 480?

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Mohsin Khan

Join Date: Jul 2015

Posts: 66
#24

30 Aug 2015, 08:36

Hi Steve,

Sorry for any confusion created. Hopefully this message will resolve the confusion.

GEE:
Total firms: 96
Total years: 5
Total firm years/number of observations: 480
Total firms that have a CINO: 8
Total firm years in which CINO is present: 27 (this is because not all firms have Cino present throughout all the 5 years)

Logistical regression:
Total firms: 94 (this is because firms that did not have a CINO for more than half of the observation period were dropped - following previous authors here)
Total firms that have a Cino: 6 (2 firms did not have it for more than half of the observation period)

Hope this clarifies

Best,
Mohsin
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#25

30 Aug 2015, 09:05

Okay, so the number of events for the gee is 27. This explains why the results look reasonable, although all models are overfit. According to the Vittinghoff simulations, you might fit as many as 4 or 5 main effect predictors without overfitting.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Mohsin Khan

Join Date: Jul 2015

Posts: 66
#26

30 Aug 2015, 09:25

Exactly... I will discuss what I gathered so far with my supervisor tomorrow. I suppose I can state the results from GEE and mention that they are tentative at best due to overfitting and that results might not replicate in other studies.

One final question here in case I want to use all the predictors: one of the papers you suggested states that bootstrapping can also be used in such situations. From my understanding, bootstrapping takes the sample as the "population" and takes out multiple samples from this "population" (with replacements) and runs the regressions on all these samples. Would you advise that I use the results from bootstrapping to show that initial results could be overfitted and that the bootstrap results would probably be more reflective of the actual population? I am thinking along these lines so that I can show some results (which are better than the overfitted ones).

Thank you for all your help!

Best,
Mohsin
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#27

30 Aug 2015, 09:46

Mohsin:
i would propose to your supervisor to investigate what happen to your results when the number of predictors are 4 or 5 max (as Steve suggests).
Personally, I would be more worried about overfit than lack of statistical significance (the latter is usually exogenous, as Steve comprehesively explained, and can be justified on the lines of "more research is needed"; the first is always researcher-dependent).

Kind regards,
Carlo
(Stata 19.0)
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#28

30 Aug 2015, 10:31

With 27 events and 11 predictors in the xtgee analyses, you have 2.45 events per predictor. This case is actually shown in the graphs on the right hand side of Figures 1 and 2 in Vittinghoff and McCullogh, 2010, albeit for main effect predictors only. However these are for models with independent data. Standard errors and CIs with xtgee will be based on between-group variation, something not studied by these authors.

I'm not sure that I would quote the bootstrap results as evidence of overfitting: the bootstrap can fail for other reasons with panel data. Notably, the companies
in some bootstrap samples might have no events. Babyak advocates use of the bootstrap for estimating shrinkage factors, something you are not doing.

You would still be better off with fewer predictors, but I think that with the xtgee analysis you are on firmer ground.

Good luck!

Babyak, MA. 2004. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med 66, no. 3: 411-421.

http://journals.lww.com/psychosomati...Brief,.21.aspx

Peduzzi PN, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373–9.

Vittinghoff, Eric, and Charles E McCulloch. 2007. Relaxing the rule of ten events per variable in logistic and Cox regression. American journal of epidemiology 165, no. 6: 710-718.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Mohsin Khan

Join Date: Jul 2015

Posts: 66
#29

01 Sep 2015, 02:55

Carlo/Steven,

I talked to my supervisor. What's decided is that I will show results as-is and also will show with less predictors. And then in the discussion I will make a note about how the results are facing a problem of overfitting and that they are tentative at best. Luckily, I have one other analysis i.e. performance consequence of having a CINO. It's OLS regression and CINO is an independent variable there so that analysis is not going to suffer from overfit.

Thank you both for your help. I sincerely appreciate it!

Best,
Mohsin
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#30

01 Sep 2015, 15:03

Glad to hear this news.

Steve

Last edited by Steve Samuels; 01 Sep 2015, 16:02.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment