strange behaviour in logit with clustering of standard errors.

tafel plankje

Join Date: Jan 2017

Posts: 5
#1

strange behaviour in logit with clustering of standard errors.

17 Apr 2019, 07:12

In performing a logistic regression with 'logit', I am encountering very significant association's (low standard errors) when I introduce the cluster() option. Without cluster() the associations are not significant at all. It does not matter what kind of groups are used to cluster on; I compared clustering on 1) the actual clusters of interest, 2) the identifier, i.e. no clusters, 3) random set of large clusters; all of these show highly significant values for some variables; while other variables seem to behave okay. However, I now I don't know which estimates I can trust to be accurate, and which I can't trust.

Does someone have an idea what could cause very low standard errors due to the cluster() option, even though the actual cluster variable doesn't seem to be of any influence? Is there anything I could check perhaps?
Tags: None

tafel plankje

Join Date: Jan 2017
Posts: 5

17 Apr 2019, 09:34

I am including a dummy dta with an example: https://d.pr/free/f/jvBCTP

please notice the difference between the interaction term:

Code:

 logit outcome i.binarypredictor c.continuouspredictor interaction 

Iteration 0:   log likelihood = -1917.0634  
Iteration 1:   log likelihood =  -1884.916  
Iteration 2:   log likelihood = -1881.5105  
Iteration 3:   log likelihood = -1881.4917  
Iteration 4:   log likelihood = -1881.4917  

Logistic regression                             Number of obs     =    196,880
                                                LR chi2(3)        =      71.14
                                                Prob > chi2       =     0.0000
Log likelihood = -1881.4917                     Pseudo R2         =     0.0186

-------------------------------------------------------------------------------------
            outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
  1.binarypredictor |   -.621448   1.579259    -0.39   0.694    -3.716738    2.473842
continuouspredictor |  -.5007834   .0590794    -8.48   0.000    -.6165769   -.3849899
        interaction |  -.7146849   1.087802    -0.66   0.511    -2.846737    1.417367
              _cons |  -6.653412   .0651262  -102.16   0.000    -6.781057   -6.525768
-------------------------------------------------------------------------------------

and

Code:

logit outcome i.binarypredictor c.continuouspredictor interaction  ,cluster(dummycluster)

Iteration 0:   log pseudolikelihood = -1917.0634  
Iteration 1:   log pseudolikelihood =  -1884.916  
Iteration 2:   log pseudolikelihood = -1881.5105  
Iteration 3:   log pseudolikelihood = -1881.4917  
Iteration 4:   log pseudolikelihood = -1881.4917  

Logistic regression                             Number of obs     =    196,880
                                                Wald chi2(3)      =     227.66
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -1881.4917               Pseudo R2         =     0.0186

                            (Std. Err. adjusted for 196,880 clusters in dummycluster)
-------------------------------------------------------------------------------------
                    |               Robust
            outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
  1.binarypredictor |   -.621448   1.001046    -0.62   0.535    -2.583462    1.340566
continuouspredictor |  -.5007834   .0694457    -7.21   0.000    -.6368945   -.3646723
        interaction |  -.7146849   .1151569    -6.21   0.000    -.9403884   -.4889815
              _cons |  -6.653412   .0657679  -101.17   0.000    -6.782315    -6.52451
-------------------------------------------------------------------------------------

Last edited by tafel plankje; 17 Apr 2019, 09:43.

Comment

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

18 Apr 2019, 11:40

First, you can generally do interactions with factor variable notation. This makes it easier to use margins.

Second, clustering generally gives you robust standard errors by cluster. While this usually increases the standard errors, it can reduce them. Whether the variable you cluster on itself is significant in the estimation is irrelevant.
Comment
tafel plankje

Join Date: Jan 2017

Posts: 5
#4

03 May 2019, 10:32

Originally posted by Phil Bromiley View Post

Second, clustering generally gives you robust standard errors by cluster. While this usually increases the standard errors, it can reduce them. Whether the variable you cluster on itself is significant in the estimation is irrelevant.

Thanks very much.

I understand cluster() can reduce standard errors.
In my case, the significance of the determinants ('interaction' and 'continuouspredictor' ) are not influenced by the clustering variable in the cluster() option. It doesn't seem to matter what variable is put in there, the standard errors are consistently very small; even if you put a clustering variable in that essentially doesn't cluster and is the identifier itself. please see the above dataset that I posted. Is that normal behaviour?

Thanks again.
Comment
tafel plankje

Join Date: Jan 2017

Posts: 5
#5

14 May 2019, 03:55

Would any expert be able to comment on the dataset posted in the second reply?
In short, the problem is that the cluster() option is inflating the signal very much and this is not dependent on the variable used in the cluster() command (try clustering on the identifier for example).
It seems like a bug in the regression, maybe the cluster() is not appropriate for this size? Any expert opinion on this particular example would be great.

Many thanks in advance,
Best, Niek
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4451
#6

14 May 2019, 05:07

Originally posted by tafel plankje View Post

However, I now I don't know which estimates I can trust to be accurate, and which I can't trust.

No expert, but I have a couple of observations.

First, you have196 880 observations and 196 880 clusters. I don't know what you're trying to do, but I assume that you know that that's not how you're supposed to use the vce(cluster varname) option.

Second, your dataset is too sparse for two main effects and their interaction:

.ÿtabulateÿbinarypredictorÿoutcomeÿifÿe(sample)

binarypredÿ|ÿÿÿÿÿÿÿÿoutcome
ÿÿÿÿÿictorÿ|ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿTotal
-----------+----------------------+----------
ÿÿÿÿÿÿÿÿÿ0ÿ|ÿÿÿ195,915ÿÿÿÿÿÿÿÿ249ÿ|ÿÿÿ196,164ÿ
ÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿÿ715ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿÿ716ÿ
-----------+----------------------+----------
ÿÿÿÿÿTotalÿ|ÿÿÿ196,630ÿÿÿÿÿÿÿÿ250ÿ|ÿÿÿ196,880

So, I would hesitate to trust either.
Comment

Announcement

strange behaviour in logit with clustering of standard errors.

Comment

Comment

Comment

Comment

Comment