In performing a logistic regression with 'logit', I am encountering very significant association's (low standard errors) when I introduce the cluster() option. Without cluster() the associations are not significant at all. It does not matter what kind of groups are used to cluster on; I compared clustering on 1) the actual clusters of interest, 2) the identifier, i.e. no clusters, 3) random set of large clusters; all of these show highly significant values for some variables; while other variables seem to behave okay. However, I now I don't know which estimates I can trust to be accurate, and which I can't trust.
Does someone have an idea what could cause very low standard errors due to the cluster() option, even though the actual cluster variable doesn't seem to be of any influence? Is there anything I could check perhaps?
Announcement
Collapse
No announcement yet.
X

strange behaviour in logit with clustering of standard errors.
Tags: None

Originally posted by tafel plankje View PostHowever, I now I don't know which estimates I can trust to be accurate, and which I can't trust.
First, you have196 880 observations and 196 880 clusters. I don't know what you're trying to do, but I assume that you know that that's not how you're supposed to use the vce(cluster varname) option.
Second, your dataset is too sparse for two main effects and their interaction:
.ÿtabulateÿbinarypredictorÿoutcomeÿifÿe(sample)
binarypredÿÿÿÿÿÿÿÿÿoutcome
ÿÿÿÿÿictorÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿTotal
++
ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿ195,915ÿÿÿÿÿÿÿÿ249ÿÿÿÿ196,164ÿ
ÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ715ÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ716ÿ
++
ÿÿÿÿÿTotalÿÿÿÿ196,630ÿÿÿÿÿÿÿÿ250ÿÿÿÿ196,880
So, I would hesitate to trust either.

Would any expert be able to comment on the dataset posted in the second reply?
In short, the problem is that the cluster() option is inflating the signal very much and this is not dependent on the variable used in the cluster() command (try clustering on the identifier for example).
It seems like a bug in the regression, maybe the cluster() is not appropriate for this size? Any expert opinion on this particular example would be great.
Many thanks in advance,
Best, Niek
Leave a comment:

Originally posted by Phil Bromiley View PostSecond, clustering generally gives you robust standard errors by cluster. While this usually increases the standard errors, it can reduce them. Whether the variable you cluster on itself is significant in the estimation is irrelevant.
I understand cluster() can reduce standard errors.
In my case, the significance of the determinants ('interaction' and 'continuouspredictor' ) are not influenced by the clustering variable in the cluster() option. It doesn't seem to matter what variable is put in there, the standard errors are consistently very small; even if you put a clustering variable in that essentially doesn't cluster and is the identifier itself. please see the above dataset that I posted. Is that normal behaviour?
Thanks again.
Leave a comment:

First, you can generally do interactions with factor variable notation. This makes it easier to use margins.
Second, clustering generally gives you robust standard errors by cluster. While this usually increases the standard errors, it can reduce them. Whether the variable you cluster on itself is significant in the estimation is irrelevant.
Leave a comment:

I am including a dummy dta with an example: https://d.pr/free/f/jvBCTP
please notice the difference between the interaction term:
Code:logit outcome i.binarypredictor c.continuouspredictor interaction Iteration 0: log likelihood = 1917.0634 Iteration 1: log likelihood = 1884.916 Iteration 2: log likelihood = 1881.5105 Iteration 3: log likelihood = 1881.4917 Iteration 4: log likelihood = 1881.4917 Logistic regression Number of obs = 196,880 LR chi2(3) = 71.14 Prob > chi2 = 0.0000 Log likelihood = 1881.4917 Pseudo R2 = 0.0186  outcome  Coef. Std. Err. z P>z [95% Conf. Interval] + 1.binarypredictor  .621448 1.579259 0.39 0.694 3.716738 2.473842 continuouspredictor  .5007834 .0590794 8.48 0.000 .6165769 .3849899 interaction  .7146849 1.087802 0.66 0.511 2.846737 1.417367 _cons  6.653412 .0651262 102.16 0.000 6.781057 6.525768 
Code:logit outcome i.binarypredictor c.continuouspredictor interaction ,cluster(dummycluster) Iteration 0: log pseudolikelihood = 1917.0634 Iteration 1: log pseudolikelihood = 1884.916 Iteration 2: log pseudolikelihood = 1881.5105 Iteration 3: log pseudolikelihood = 1881.4917 Iteration 4: log pseudolikelihood = 1881.4917 Logistic regression Number of obs = 196,880 Wald chi2(3) = 227.66 Prob > chi2 = 0.0000 Log pseudolikelihood = 1881.4917 Pseudo R2 = 0.0186 (Std. Err. adjusted for 196,880 clusters in dummycluster)   Robust outcome  Coef. Std. Err. z P>z [95% Conf. Interval] + 1.binarypredictor  .621448 1.001046 0.62 0.535 2.583462 1.340566 continuouspredictor  .5007834 .0694457 7.21 0.000 .6368945 .3646723 interaction  .7146849 .1151569 6.21 0.000 .9403884 .4889815 _cons  6.653412 .0657679 101.17 0.000 6.782315 6.52451 
Last edited by tafel plankje; 17 Apr 2019, 09:43.
Leave a comment:
Leave a comment: