Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • logit: how to deal with "completely determined" case (while using interaction terms)

    Hello Statalisters,

    I am working on my thesis to test certain hypothesis related to presence/absence of chief innovation officers in companies. I am having problem with adding an interaction term - I get the "completely determined" error. I went through the FAQ explanation on completely determined case quite a few times, but I can't figure out what it means really (I have a very weak background in econometrics). So what's happening is:

    Code:
    logit cino asg_1 c.avten##c.poc0 pcoo avtmt pdc avri avhhi  avlemp avtd, vce(robust) nolog
    
    Logistic regression                               Number of obs   =         94
                                                      Wald chi2(11)   =     601.58
                                                      Prob > chi2     =     0.0000
    Log pseudolikelihood = -7.1855934                 Pseudo R2       =     0.6780
    
    --------------------------------------------------------------------------------
                   |               Robust
              cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    ---------------+----------------------------------------------------------------
             asg_1 |    26.6738   9.313385     2.86   0.004     8.419897     44.9277
             avten |   .1652309   .0981544     1.68   0.092    -.0271482    .3576099
              poc0 |   897.4599   97.28513     9.23   0.000     706.7845    1088.135
                   |
    c.avten#c.poc0 |  -755.3955   79.74684    -9.47   0.000    -911.6964   -599.0946
                   |
              pcoo |  -.6168743   2.602292    -0.24   0.813    -5.717274    4.483525
             avtmt |   1.217036   .3969629     3.07   0.002     .4390034    1.995069
               pdc |   1.779189   2.026467     0.88   0.380    -2.192613    5.750992
              avri |  -16.00313   7.050492    -2.27   0.023    -29.82184   -2.184417
             avhhi |   .0006492   .0013182     0.49   0.622    -.0019344    .0032327
            avlemp |   .5766414     1.0779     0.53   0.593    -1.536004    2.689287
              avtd |   -6.27929   2.206887    -2.85   0.004    -10.60471   -1.953871
             _cons |  -20.94125   5.360846    -3.91   0.000    -31.44831   -10.43418
    --------------------------------------------------------------------------------
    Note: 37 failures and 0 successes completely determined.
    As you can see, the co-efficient on the interaction terms is ridiculously high. So is the case for the variable poc0. avten is, for a given firm, the average tenure of CEOs during my observation period. poc0 is, for a given firm, a variable that determines the proportion of years in which the company has had an Outsider CEO:

    Code:
    tab poc0
    
     Proportion |
    of years in |
          which |
       Outsider |
         CEO is |
     present (0 |
        cutoff) |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |         58       61.70       61.70
             .2 |          4        4.26       65.96
             .4 |          2        2.13       68.09
             .6 |          5        5.32       73.40
             .8 |          2        2.13       75.53
              1 |         23       24.47      100.00
    ------------+-----------------------------------
          Total |         94      100.00
    To add some further background, the analysis above is done to support my first analysis - which is taking a balanced dataset of the firms for 5 years and using xtgee. When I use xtgee and use the interaction terms of tenure and outsider ceo, it works fine (though in that analysis, outsider ceo is taken as a binary variable). Can someone please guide me in the right direction here?

    Please let me know if further details are required. Thank you.

    Kind regards,
    Mohsin

  • #2
    For how many observations does cino = 1?
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Mohsin:
      I do not know whether what follows may be helpful, but I notice that poc0 is not really continue, as it shows jumps of 0.2 in -tab-.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Steve, it's only for 8 out of a total of 94 observations.

        Carlo, to give you some background, I have a dataset of 94 companies for total of 5 years, giving total firm years of 470. First I run xtgee analysis, and then to give it support, I run a logit analysis (but this time, taking average of the each variable over the 5 year period, which essentially trims the dataset from 470 observations to 94). In the logit analysis, the variable poc0 is basically the proportion of years in which there is an outsider ceo (which was used as binary in xtgee analysis) - therefore the jump of .2. With the jump of .2, should I not be treating it as a continuous variable then?

        Thank you, both.

        Kind regards,
        Mohsin

        Comment


        • #5
          What you have done is called "overfitting". With eight events you can fit only 1 or 2 predictors at a time. Simulations have shown that you need at least 5-9 events per predictor (actually, the smaller of the number of events and non-events). However those simulations did not include iinteraction terms, which would probably increase the requirement. See Vittinghoff and McCulloch, 2007.
          References:

          Babyak, MA. 2004. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med 66, no. 3: 411-421.
          http://journals.lww.com/psychosomati...Brief,.21.aspx

          Peduzzi PN, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373–9.

          Vittinghoff, Eric, and Charles E McCulloch. 2007. Relaxing the rule of ten events per variable in logistic and Cox regression. American journal of epidemiology 165, no. 6: 710-718.



          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            What you have done is called "overfitting". With eight events you can fit only 1 or 2 predictors at a time. Let n* = the smaller of the number of events and non-events. In your study n* = 8. Simulations (Vittinghoff and McCullogh, 2007) have shown that for good estimation, the ratio of n*/(# predictors) should be 5-9, at least. Those simulations did not include interactions, which would probably increase the requirement.

            References:
            Babyak, MA. 2004. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med 66, no. 3: 411-421, available here

            Peduzzi PN, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373–9.

            Vittinghoff, Eric, and Charles E McCulloch. 2007. Relaxing the rule of ten events per variable in logistic and Cox regression. American journal of epidemiology 165, no. 6: 710-718.


            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Mohsin:
              I'm not clear with your running a -logit- instead of a -xtlogit- to
              to give [-xtgee-]support
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Sorry for the double post. Try exlogistic if you are interested in, say, three predictors. You can condition on a small number of others that you consider to be nuisance variables.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #9
                  Steve,

                  Thank you so much for your help! Sorry for the late reply, I was going through the papers you sent me and due to my weak background of econometrics, it took me some time to understand. I have to say, the paper were a bit eye opener for me because up until now, I was actually going to believe some of the results I had - but now I should view everything with a grain of salt. I tried your suggestion and got the following results:

                  Code:
                  exlogistic cino avri avhhi avtd, cond(asg_1  avtmt pdc) memory(2g)
                  
                  note: distribution for (avri | avhhi avtd) is degenerate
                  note: distribution for (avhhi | avri avtd) is degenerate
                  note: distribution for (avtd | avri avhhi) is degenerate
                  
                  Exact logistic regression
                                                                   Number of obs =         94
                  ---------------------------------------------------------------------------
                          cino | Odds Ratio       Suff.  2*Pr(Suff.)     [95% Conf. Interval]
                  -------------+-------------------------------------------------------------
                          avri |          1   -.2155033                         0       +Inf
                         avhhi |          1    2439.134                         0       +Inf
                          avtd |          1    .6660169                         0       +Inf
                  ---------------------------------------------------------------------------
                  May I know how to interpret this?

                  Thank you!

                  Kind regards,
                  Mohsin

                  Comment


                  • #10
                    Carlo,

                    With some bad luck, I took this topic for my thesis (but I have a very weak background in econometrics) and am basically following 2 papers that have done similar work. As I mentioned earlier, in the first analysis I use xtgee and I hope I am doing this correctly because the papers just state:

                    Paper 1: "In a first analysis, we pooled the longitudinal data.... we applied pooled logistic regression, using generalized estimating equations"
                    Paper 2: "Given that our dependent variable of CMO presence is binary and that we repeatedly observe the same firms over a period of time, we used the generalized estimating equations (GEE) approach"

                    And for the second analysis they state:

                    Paper 1: " To verify our first analysis, we conducted a second analysis assuming that some firms are generally more prone to have a CSO in their TMT than others...we averaged all time-varying variables across the study’s five-year period and subsequently applied a logistic regression analysis"
                    Paper 2: "We provided this sec- ond analysis as a support for the results from the longitudinal analysis... the results from this analysis, which employs logistic regression"

                    Based on this, I used xtgee for the first type of analysis and logistic regression for the second type of analysis.

                    Paper 1: Menz, M., & Scheef, C. (2014). Chief strategy officers: Contingency analysis of their presence in top management teams. Strategic Management Journal,35(3), 461-471.
                    Paper 2: Nath, P., & Mahajan, V. (2008). Chief marketing officers: A study of their presence in firms' top management teams. Journal of Marketing, 72(1), 65-81.

                    When having a jump of .2 in poc0, can I not use it as a continuous variable?

                    Thank you!

                    Kind regards,
                    Mohsin

                    Comment


                    • #11
                      Steve/Carlo,

                      Just want to add one thing here - I know I am not knowledgable about stata and associated background in such analysis. Because of this, I am sure my posts sounds stupid at times. You guys are expert in this area and I hope that my lack of knowledge in a field that you guys hold so close do not offend you in anyway. I am trying my best to catch up with what you guys post and when I reply back, I make sure I try to understand what you guys have suggested. So if anywhere it seems like I did not put in the effort to not understand, please know that it's not because of my unwillingness to learn, but rather because of my limited background in this field. I am under a lot of pressure because my deadline is around the corner and am trying my best to understand, in the limited time that I have, what you guys post.

                      Thank you!
                      Mohsin

                      Comment


                      • #12
                        Mohsin:
                        I feel myself too being a learner, so this is not the issue.
                        I found still strange that authors used a logistic regression on averaged values instead of a panel data analysis, but reviewers clearly found this approach sound.
                        I am not familiar with -exlogistic- so I cannot help you out in this respect.
                        However, Steve posed a serious issue concerning overfitting. In a nutshell, you seem to have too many predictors vs a (too) limited sample size.
                        Please consider that there should be 20 observations per predictor (
                        Katz MH. Multivariable Analysis. Second Edtion. NY: Cambridge University Press, 2006: 81), even though 10 obs per predictor may sound wise enough.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          The infinite CIs mean that there is not enough information in the data for exlogistic to estimate the model. firthlogit (SSC) is a command I've never tried, but is also said to handle difficult data. I suspect that it too won't work.

                          I think that I can speak for Carlo in saying that we are impressed by the effort that you are putting in. The inadequacy of the data is not your fault. When students have encountered this kind of problem in the past, we have solved it in a number of ways. Some solutions pertinent to your situation, listed in rough order of preference: 1) find a different, but related, question that can be answered with the data; 2) analyze the same question in a different data set; 3) alter the data (change non-events to events) at random, so that you can do a "valid", though unpublishable analysis; 4) use a smaller number of predictors and find some models without the "completely determined" error; this analysis is also not publishable. I recommend that you have a conversation with your Advisor about your options. One possibility: make the thesis about this problem and the solutions that you've tried.
                          Last edited by Steve Samuels; 28 Aug 2015, 21:09.
                          Steve Samuels
                          Statistical Consulting
                          [email protected]

                          Stata 14.2

                          Comment


                          • #14
                            Steve:
                            very well said.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Carlo,

                              Thank you - I will also have a discussion with my supervisor. Let's hope he has some suggestions.

                              Best,
                              Mohsin

                              Comment

                              Working...
                              X