Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Steve,

    Thank you for your encouragement and detailed response. Due to time limitation, and the fact that I have done all relevant literature review + write up, changing the topic/data set does not seem like a luxury I can afford right now. Regarding your point 3 - change of non-events to events - would you by any chance have any publication in mind? I can study that and lend support to applying this procedure. I have a meeting with my supervisor coming Monday and I will discuss some of the options you suggested + the things I have tried so far.

    Lastly, I suppose I should be wary of my results of xtgee as well. I have a CINO in 27 out of the 480 firm years. I suppose the results here would be "overfitting" as well.


    Best,
    Mohsin

    Comment


    • #17
      I did some further analysis with the xtgee option, and using robust I get the following results:

      Code:
      xtgee cino atob_1 ten_1 coo_1 tmt_1 fyear dc_1 ari_1 hhi_1 oc0_1 lemp_1 td_1, family(binomial 1) link(logit) corr(ar1) vce(robust) nolog
      
      GEE population-averaged model                   Number of obs      =       480
      Group and time vars:              id fyear      Number of groups   =        96
      Link:                                logit      Obs per group: min =         5
      Family:                           binomial                     avg =       5.0
      Correlation:                         AR(1)                     max =         5
                                                      Wald chi2(11)      =     36.70
      Scale parameter:                         1      Prob > chi2        =    0.0001
      
                                           (Std. Err. adjusted for clustering on id)
      ------------------------------------------------------------------------------
                   |               Robust
              cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            atob_1 |   .0084967    .270219     0.03   0.975    -.5211229    .5381162
             ten_1 |  -.0752956   .0493496    -1.53   0.127    -.1720191    .0214278
             coo_1 |  -.3792427   .8102774    -0.47   0.640    -1.967357    1.208872
             tmt_1 |   .3732676   .0847297     4.41   0.000     .2072006    .5393347
             fyear |   .2598641   .1514751     1.72   0.086    -.0370217    .5567499
              dc_1 |   .6400543   .6728775     0.95   0.341    -.6787613     1.95887
             ari_1 |   1.730946   5.717253     0.30   0.762    -9.474664    12.93656
             hhi_1 |   -.001132   .0011479    -0.99   0.324    -.0033818    .0011179
             oc0_1 |  -2.050357   .9198652    -2.23   0.026     -3.85326   -.2474548
            lemp_1 |  -.2630874    .307941    -0.85   0.393    -.8666407    .3404659
              td_1 |  -1.966205   .9827527    -2.00   0.045    -3.892365   -.0400453
             _cons |  -527.6172   305.1141    -1.73   0.084     -1125.63    70.39539
      ------------------------------------------------------------------------------
      Without robust option, I get the same result except fyear is not significant anymore:

      Code:
      . xtgee cino atob_1 ten_1 coo_1 tmt_1 fyear dc_1 ari_1 hhi_1 oc0_1 lemp_1 td_1, family(binomial 1) link(logit) corr(ar1) nolog
      
      GEE population-averaged model                   Number of obs      =       480
      Group and time vars:              id fyear      Number of groups   =        96
      Link:                                logit      Obs per group: min =         5
      Family:                           binomial                     avg =       5.0
      Correlation:                         AR(1)                     max =         5
                                                      Wald chi2(11)      =     23.16
      Scale parameter:                         1      Prob > chi2        =    0.0168
      
      ------------------------------------------------------------------------------
              cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            atob_1 |   .0084967   .2472919     0.03   0.973    -.4761866    .4931799
             ten_1 |  -.0752956   .0543156    -1.39   0.166    -.1817522     .031161
             coo_1 |  -.3792427   .8545886    -0.44   0.657    -2.054206     1.29572
             tmt_1 |   .3732676   .0894193     4.17   0.000     .1980089    .5485264
             fyear |   .2598641   .1793483     1.45   0.147    -.0916521    .6113803
              dc_1 |   .6400543   .6222863     1.03   0.304    -.5796044    1.859713
             ari_1 |   1.730946   2.733886     0.63   0.527    -3.627372    7.089264
             hhi_1 |   -.001132   .0010131    -1.12   0.264    -.0031177    .0008538
             oc0_1 |  -2.050357   1.140532    -1.80   0.072    -4.285758    .1850432
            lemp_1 |  -.2630874   .2387573    -1.10   0.271    -.7310431    .2048683
              td_1 |  -1.966205   1.114908    -1.76   0.078    -4.151385    .2189747
             _cons |  -527.6172   360.9528    -1.46   0.144    -1235.072    179.8374
      ------------------------------------------------------------------------------
      However, when I used bootstrap option (this is based on one of the papers you suggested which said bootstrapping could also be used to refine the analysis in case of limited "events"), none of the results are significant:

      Code:
      . xtgee cino atob_1 ten_1 coo_1 tmt_1 fyear dc_1 ari_1 hhi_1 oc0_1 lemp_1 td_1, family(binomial 1) link(logit) corr(ar1) nolog vce(boot)
      (running xtgee on estimation sample)
      
      
      Bootstrap replications (50)
      ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
      x.xxxxxxx.xx..xx..xx..xxxxxxx.xxxx...xxx.x.xxxxxx.    50
      
      GEE population-averaged model                   Number of obs      =       480
      Group and time vars:              id fyear      Number of groups   =        96
      Link:                                logit      Obs per group: min =         5
      Family:                           binomial                     avg =       5.0
      Correlation:                         AR(1)                     max =         5
                                                      Wald chi2(11)      =      9.53
      Scale parameter:                         1      Prob > chi2        =    0.5729
      
                                           (Replications based on 96 clusters in id)
      ------------------------------------------------------------------------------
                   |   Observed   Bootstrap                         Normal-based
              cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            atob_1 |   .0084967   .5187571     0.02   0.987    -1.008249    1.025242
             ten_1 |  -.0752956   .1384705    -0.54   0.587    -.3466927    .1961015
             coo_1 |  -.3792427   5.964878    -0.06   0.949    -12.07019     11.3117
             tmt_1 |   .3732676   .2752734     1.36   0.175    -.1662583    .9127936
             fyear |   .2598641   .3678355     0.71   0.480    -.4610802    .9808084
              dc_1 |   .6400543   6.916288     0.09   0.926    -12.91562    14.19573
             ari_1 |   1.730946   13.98974     0.12   0.902    -25.68844    29.15033
             hhi_1 |   -.001132    .001228    -0.92   0.357    -.0035388    .0012749
             oc0_1 |  -2.050357   1.386162    -1.48   0.139    -4.767186    .6664706
            lemp_1 |  -.2630874   .6449271    -0.41   0.683    -1.527121    1.000946
              td_1 |  -1.966205   2.115764    -0.93   0.353    -6.113026    2.180616
             _cons |  -527.6172   738.7412    -0.71   0.475    -1975.523    920.2889
      ------------------------------------------------------------------------------
      Note: one or more parameters could not be estimated in 35 bootstrap replicates;
            standard-error estimates include only complete replications.
      Should I take the results of bootstrap as an indication that I am overfitting my model for xtgee, too?

      Best,
      Mohsin

      Comment


      • #18
        The output from your original model states:
        Code:
        Note: 37 failures and 0 successes completely determined.
        Yet you say that you had only four "failures" (cento = 1). So, which is it?
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #19
          Correction: You actually said had eight events, not four.
          Last edited by Steve Samuels; 29 Aug 2015, 18:04.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #20
            Hi Steve,

            Sorry, I mistyped 8. I meant 6. Please see below:

            Code:
            tab cino
            
               1= chief |
             innovation |
             officer is |
                present |      Freq.     Percent        Cum.
            ------------+-----------------------------------
                      0 |         88       93.62       93.62
                      1 |          6        6.38      100.00
            ------------+-----------------------------------
                  Total |         94      100.00

            Comment


            • #21
              And the number of groups in xtgee analysis is 96 because I dropped two firms in the logit analysis. This is because I am only looking at firms that had Cino for more than half of the years during the observation period (I am following what the other authors did here i.e. looking at "CINO prone" firms).

              Comment


              • #22
                I misread the message in your first post--the "37 failures" refer to cases of cino = 0. All these models are severely overfitted. You can report the results, but must state that estimates are badly biased and , p-values, and CIs are inaccurate (exactly how inaccurate, I can't tell; the overfitting is much worse than in any of the published simulations.) This should be okay for a Master's Thesis, but the results, in my opinion, should not be published.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #23
                  Let's step back one more time: Is six the number of observations with cino= 1 in the GEE data set with n = 480?
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #24
                    Hi Steve,

                    Sorry for any confusion created. Hopefully this message will resolve the confusion.

                    GEE:
                    Total firms: 96
                    Total years: 5
                    Total firm years/number of observations: 480
                    Total firms that have a CINO: 8
                    Total firm years in which CINO is present: 27 (this is because not all firms have Cino present throughout all the 5 years)

                    Logistical regression:
                    Total firms: 94 (this is because firms that did not have a CINO for more than half of the observation period were dropped - following previous authors here)
                    Total firms that have a Cino: 6 (2 firms did not have it for more than half of the observation period)

                    Hope this clarifies

                    Best,
                    Mohsin

                    Comment


                    • #25
                      Okay, so the number of events for the gee is 27. This explains why the results look reasonable, although all models are overfit. According to the Vittinghoff simulations, you might fit as many as 4 or 5 main effect predictors without overfitting.
                      Steve Samuels
                      Statistical Consulting
                      [email protected]

                      Stata 14.2

                      Comment


                      • #26
                        Exactly... I will discuss what I gathered so far with my supervisor tomorrow. I suppose I can state the results from GEE and mention that they are tentative at best due to overfitting and that results might not replicate in other studies.

                        One final question here in case I want to use all the predictors: one of the papers you suggested states that bootstrapping can also be used in such situations. From my understanding, bootstrapping takes the sample as the "population" and takes out multiple samples from this "population" (with replacements) and runs the regressions on all these samples. Would you advise that I use the results from bootstrapping to show that initial results could be overfitted and that the bootstrap results would probably be more reflective of the actual population? I am thinking along these lines so that I can show some results (which are better than the overfitted ones).


                        Thank you for all your help!

                        Best,
                        Mohsin

                        Comment


                        • #27
                          Mohsin:
                          i would propose to your supervisor to investigate what happen to your results when the number of predictors are 4 or 5 max (as Steve suggests).
                          Personally, I would be more worried about overfit than lack of statistical significance (the latter is usually exogenous, as Steve comprehesively explained, and can be justified on the lines of "more research is needed"; the first is always researcher-dependent).
                          Kind regards,
                          Carlo
                          (Stata 19.0)

                          Comment


                          • #28
                            With 27 events and 11 predictors in the xtgee analyses, you have 2.45 events per predictor. This case is actually shown in the graphs on the right hand side of Figures 1 and 2 in Vittinghoff and McCullogh, 2010, albeit for main effect predictors only. However these are for models with independent data. Standard errors and CIs with xtgee will be based on between-group variation, something not studied by these authors.

                            I'm not sure that I would quote the bootstrap results as evidence of overfitting: the bootstrap can fail for other reasons with panel data. Notably, the companies
                            in some bootstrap samples might have no events. Babyak advocates use of the bootstrap for estimating shrinkage factors, something you are not doing.


                            You would still be better off with fewer predictors, but I think that with the xtgee analysis you are on firmer ground.

                            Good luck!



                            Babyak, MA. 2004. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med 66, no. 3: 411-421.

                            http://journals.lww.com/psychosomati...Brief,.21.aspx

                            Peduzzi PN, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996;49:1373–9.

                            Vittinghoff, Eric, and Charles E McCulloch. 2007. Relaxing the rule of ten events per variable in logistic and Cox regression. American journal of epidemiology 165, no. 6: 710-718.

                            Steve Samuels
                            Statistical Consulting
                            [email protected]

                            Stata 14.2

                            Comment


                            • #29
                              Carlo/Steven,

                              I talked to my supervisor. What's decided is that I will show results as-is and also will show with less predictors. And then in the discussion I will make a note about how the results are facing a problem of overfitting and that they are tentative at best. Luckily, I have one other analysis i.e. performance consequence of having a CINO. It's OLS regression and CINO is an independent variable there so that analysis is not going to suffer from overfit.

                              Thank you both for your help. I sincerely appreciate it!

                              Best,
                              Mohsin

                              Comment


                              • #30
                                Glad to hear this news.

                                Steve
                                Last edited by Steve Samuels; 01 Sep 2015, 16:02.
                                Steve Samuels
                                Statistical Consulting
                                [email protected]

                                Stata 14.2

                                Comment

                                Working...
                                X