GEE: Correlation structure

Mohsin Khan

Join Date: Jul 2015

Posts: 66
#1

GEE: Correlation structure

23 Jul 2015, 14:09

Hi All,

I am doing an analysis where my dependent variable is binary - presence(1)/absence(0) of a chief innovation officer in the top management team. My dataset is a balanced panel of 100 firms, with the data spread over 5 years. Since the same firms are repeated, I initially used exchangeable correlation structure. However, the wald chi2 and prob > chi2 (16.48 and .1243 respectively) is very weak compared to that of independent correlation ( 37.43 and 0.0001). Moreover, the independent variables that are significant also varies between the two setups. Can someone please guide me here on how I can verify which correlation structure to use.

Kind regards,
Mohsin
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

25 Jul 2015, 21:15

Welcome to Statalist, Mohsin!

FAQ section 12 ask that you show the exact commands and all results of those commands. Please do so and put the commands and results inside CODE delimiters, as the Section asks. I'm curious why you chose only the independence and exchangeable options to compare. More realistic would be "unstructured" and "ar" For an example of comparing correlation structures, see the section on estat correlation in the Manual Entry for xtgee postestimation

Last edited by Steve Samuels; 25 Jul 2015, 21:18.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Mohsin Khan

Join Date: Jul 2015
Posts: 66

26 Jul 2015, 06:53

Hi Steven,

Thank you for helping me organise my post. Please find below the codes and results:

Ind structure:

Code:

xtgee cino tmt ten oceo0 dceo coo aroa_1 ari_1 asg_1 alat_1 hhi_1, family(binomial 1) link(logit) corr(ind) nolog

Results:
GEE population-averaged model                   Number of obs      =       500
Group variable:                         id      Number of groups   =       100
Link:                                logit      Obs per group: min =         5
Family:                           binomial                     avg =       5.0
Correlation:                   independent                     max =         5
                                                Wald chi2(10)      =     36.22
Scale parameter:                         1      Prob > chi2        =    0.0001

Pearson chi2(500):                  377.96      Deviance           =    185.20
Dispersion (Pearson):             .7559285      Dispersion         =  .3704039

------------------------------------------------------------------------------
        cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         tmt |   .3322662   .0579783     5.73   0.000     .2186309    .4459015
         ten |  -.0103753   .0380548    -0.27   0.785    -.0849613    .0642107
       oceo0 |  -.1421095   .4680038    -0.30   0.761     -1.05938    .7751611
        dceo |   .3836575   .4472361     0.86   0.391    -.4929091    1.260224
         coo |   .1534408   .5733054     0.27   0.789    -.9702172    1.277099
      aroa_1 |   .2502174   2.142053     0.12   0.907    -3.948129    4.448564
       ari_1 |  -.2168398   1.289913    -0.17   0.867    -2.745023    2.311343
       asg_1 |   .0600475   .3922838     0.15   0.878    -.7088146    .8289096
      alat_1 |  -.3129255    .169878    -1.84   0.065    -.6458802    .0200293
       hhi_1 |  -.0017149   .0007905    -2.17   0.030    -.0032642   -.0001656
       _cons |  -4.767029   .7401897    -6.44   0.000    -6.217774   -3.316284
------------------------------------------------------------------------------

Exc structure:

Code:

xtgee cino tmt ten oceo0 dceo coo aroa_1 ari_1 asg_1 alat_1 hhi_1, family(binomial 1) link(logit) corr(exc) nolog

Results:
GEE population-averaged model                   Number of obs      =       500
Group variable:                         id      Number of groups   =       100
Link:                                logit      Obs per group: min =         5
Family:                           binomial                     avg =       5.0
Correlation:                  exchangeable                     max =         5
                                                Wald chi2(10)      =     14.50
Scale parameter:                         1      Prob > chi2        =    0.1515

------------------------------------------------------------------------------
        cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         tmt |   .1105161   .0626282     1.76   0.078    -.0122329    .2332651
         ten |   .0759874   .0389548     1.95   0.051    -.0003625    .1523373
       oceo0 |    .200225   .5087608     0.39   0.694    -.7969278    1.197378
        dceo |   .2377148   .4733204     0.50   0.616    -.6899762    1.165406
         coo |   .1153037   .4676706     0.25   0.805    -.8013138    1.031921
      aroa_1 |  -2.875153   1.415628    -2.03   0.042    -5.649733   -.1005725
       ari_1 |  -.0933946   .6585011    -0.14   0.887    -1.384033    1.197244
       asg_1 |   .1679215   .2662192     0.63   0.528    -.3538586    .6897016
      alat_1 |   .2121619    .240677     0.88   0.378    -.2595563    .6838802
       hhi_1 |   .0013722   .0007857     1.75   0.081    -.0001678    .0029123
       _cons |  -5.816824   1.213502    -4.79   0.000    -8.195245   -3.438403
------------------------------------------------------------------------------

Uns structure gives the following error:
convergence not achieved
r(430);

Code:

xtgee cino tmt ten oceo0 dceo coo aroa_1 ari_1 asg_1 alat_1 hhi_1, family(binomial 1) link(logit) corr(uns) nolog

Finally, using ar structure:

Code:

xtgee cino tmt ten oceo0 dceo coo aroa_1 ari_1 asg_1 alat_1 hhi_1, family(binomial 1) link(logit) corr(ar 1) nolog

Results:
GEE population-averaged model                   Number of obs      =       500
Group and time vars:              id fyear      Number of groups   =       100
Link:                                logit      Obs per group: min =         5
Family:                           binomial                     avg =       5.0
Correlation:                         AR(1)                     max =         5
                                                Wald chi2(10)      =      5.74
Scale parameter:                         1      Prob > chi2        =    0.8366

------------------------------------------------------------------------------
        cino |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         tmt |   .0252484   .0632784     0.40   0.690     -.098775    .1492718
         ten |   .0217006   .0376289     0.58   0.564    -.0520507    .0954519
       oceo0 |   .0232131   .4863239     0.05   0.962    -.9299643    .9763905
        dceo |   .0473541   .4276206     0.11   0.912    -.7907669    .8854752
         coo |    .124459   .4151263     0.30   0.764    -.6891736    .9380916
      aroa_1 |  -2.197934   1.087229    -2.02   0.043    -4.328864   -.0670045
       ari_1 |  -.1314868   .7020972    -0.19   0.851    -1.507572    1.244598
       asg_1 |   .0709553   .1916394     0.37   0.711     -.304651    .4465616
      alat_1 |   .2346929   .2187694     1.07   0.283    -.1940874    .6634731
       hhi_1 |    .000734   .0009118     0.81   0.421     -.001053     .002521
       _cons |  -3.985506   1.094861    -3.64   0.000    -6.131395   -1.839617
------------------------------------------------------------------------------

The significance of the overall model decreases from ind>exc>ar. Moreover, the I.Vs that are significant also varies with the models.

I am following past papers in which authors have used GEE, but for other top management team executives - chief operation officer, chief strategy officers and chief marketing officers. In their analysis, presence of the executive officers were in at least 20% or more firms years. While in my data, out of the 500 firm years in only 35 firm years is a cino present (=1) which is a mere 7%. Does the fact that only in 7% of the firm years is a cino present effect the choice of regression type I should be using?

Thank you in advance!
Mohsin

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

26 Jul 2015, 18:31

Thank you for using the CODE delimiters. It makes your results very easy to use.

I see that you did not use the option vce(robust). Repeat the analyses and do so. Doing so will give valid standard errors no matter what the working correlation structure.Failure to do so will almost certainly give biased standard errors. I myself would favor the ar 1 working correlation model, as it is likely to reduce standard errors when used with vce(robust).
I notice that your model is quite simple, with no interactions and (apparently) no non-linear terms. What is the goal of the analysis?

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Mohsin Khan

Join Date: Jul 2015

Posts: 66
#5

29 Jul 2015, 09:40

Thank you for your reply. I tried using vce(robust) and the significance improved a lot. And yes, you're right. It makes more sense to use ar1. I tried using uns but again, no convergence was achieved. As for the model, yes you're right - what I shared was quite basic. I didn't add the interaction terms yet. I was playing around with the models to understand how it works. Now I have a fair idea and will work on adding the different interactions. Thanks again!
Comment
Mohsin Khan

Join Date: Jul 2015

Posts: 66
#6

09 Aug 2015, 09:21

Hi Steven/Statalisters,

I have a follow up question regards to the correlation structure. I went through

Cui, James. "QIC program and model selection in GEE analyses." Stata journal7.2 (2007): 209.

and

Hardin, James W & Hilbe, Joseph M. Generalized estimating equations (GEE). Chapman and Hall/CRC, 2012.

in order to identify which correlation structure to use. Based on the texts, it is mentioned that the correlation structure that minimises the qic should be used. What I find amusing is that using the same data, when I switch from using log of sales as a proxy for firm size to use log of employees, the correlation structure that minimises the qic switches between the two. For the first - using log of sales, it comes out stationary of the order 1

Code:

qic cino asg_1 ten_1 coo_1 tmt_1 fyear dc_1 ari_1 hhi_1 oc0_1 lsale_1 td_1, family(binomial 1) link(logit) corr(sta1) robust nolog nodisplay QIC and QIC_u ___________________________________________ Corr = sta1 Family = binomial 1 Link = logit p = 12 Trace = 24.885 QIC = 188.596 QIC_u = 162.827 ___________________________________________

And using log of employees, it comes out to be autoregressive of order 1

Code:

qic cino asg_1 ten_1 coo_1 tmt_1 fyear dc_1 ari_1 hhi_1 oc0_1 lemp_1 td_1, family(binomial 1) link(logit) corr(ar1) robust nolog nodisplay QIC and QIC_u ___________________________________________ Corr = ar1 Family = binomial 1 Link = logit p = 12 Trace = 24.246 QIC = 189.616 QIC_u = 165.125 ___________________________________________

I have not posted the qic for other structures such as ind, exc etc. in order to save space. Is there a reason why, for essentially the same data, the correlation structure that best suits it should change by simply changing one variable? From my limited understanding, I thought that the correlation structure is for the data overall, and not so dependent on one variable. But then again, I can be wrong. Can someone please shed some light on this?

Moreover, I read in the texts that GEE is robust to correlation structures, even if it is misspecified. Does that mean I should not be paying too much attention to the correlation structures?

Thanking you all in advance,
Mohsin
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

10 Aug 2015, 15:39

Yes, GEE is robust to the chosen structure; the choice only affects the precision of estimates.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Mohsin Khan

Join Date: Jul 2015

Posts: 66
#8

12 Aug 2015, 04:42

Thank you again, Steve. You have been very helpful! Do you know why might the "best correlation structure according to qic" change from sta1 to ar1 just by changing one variable? Shouldn't the "best correlation structure" be based on overall data, rather than being influenced so much by one variable?

As always, sincerely appreciate your help!
Mohsin
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

12 Aug 2015, 17:41

Sorry, but I can't answer your latest question. I suggest that you ask it in a new thread. I'll just observe that you could put both variables into a model.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Mohsin Khan

Join Date: Jul 2015

Posts: 66
#10

13 Aug 2015, 05:05

That's understandable. Thank you for your help so far.
Comment

Announcement