Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • strange behaviour in logit with clustering of standard errors.

    In performing a logistic regression with 'logit', I am encountering very significant association's (low standard errors) when I introduce the cluster() option. Without cluster() the associations are not significant at all. It does not matter what kind of groups are used to cluster on; I compared clustering on 1) the actual clusters of interest, 2) the identifier, i.e. no clusters, 3) random set of large clusters; all of these show highly significant values for some variables; while other variables seem to behave okay. However, I now I don't know which estimates I can trust to be accurate, and which I can't trust.

    Does someone have an idea what could cause very low standard errors due to the cluster() option, even though the actual cluster variable doesn't seem to be of any influence? Is there anything I could check perhaps?


  • #2
    I am including a dummy dta with an example: https://d.pr/free/f/jvBCTP

    please notice the difference between the interaction term:

    Code:
     logit outcome i.binarypredictor c.continuouspredictor interaction 
    
    Iteration 0:   log likelihood = -1917.0634  
    Iteration 1:   log likelihood =  -1884.916  
    Iteration 2:   log likelihood = -1881.5105  
    Iteration 3:   log likelihood = -1881.4917  
    Iteration 4:   log likelihood = -1881.4917  
    
    Logistic regression                             Number of obs     =    196,880
                                                    LR chi2(3)        =      71.14
                                                    Prob > chi2       =     0.0000
    Log likelihood = -1881.4917                     Pseudo R2         =     0.0186
    
    -------------------------------------------------------------------------------------
                outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------------+----------------------------------------------------------------
      1.binarypredictor |   -.621448   1.579259    -0.39   0.694    -3.716738    2.473842
    continuouspredictor |  -.5007834   .0590794    -8.48   0.000    -.6165769   -.3849899
            interaction |  -.7146849   1.087802    -0.66   0.511    -2.846737    1.417367
                  _cons |  -6.653412   .0651262  -102.16   0.000    -6.781057   -6.525768
    -------------------------------------------------------------------------------------
    and

    Code:
    logit outcome i.binarypredictor c.continuouspredictor interaction  ,cluster(dummycluster)
    
    Iteration 0:   log pseudolikelihood = -1917.0634  
    Iteration 1:   log pseudolikelihood =  -1884.916  
    Iteration 2:   log pseudolikelihood = -1881.5105  
    Iteration 3:   log pseudolikelihood = -1881.4917  
    Iteration 4:   log pseudolikelihood = -1881.4917  
    
    Logistic regression                             Number of obs     =    196,880
                                                    Wald chi2(3)      =     227.66
                                                    Prob > chi2       =     0.0000
    Log pseudolikelihood = -1881.4917               Pseudo R2         =     0.0186
    
                                (Std. Err. adjusted for 196,880 clusters in dummycluster)
    -------------------------------------------------------------------------------------
                        |               Robust
                outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------------+----------------------------------------------------------------
      1.binarypredictor |   -.621448   1.001046    -0.62   0.535    -2.583462    1.340566
    continuouspredictor |  -.5007834   .0694457    -7.21   0.000    -.6368945   -.3646723
            interaction |  -.7146849   .1151569    -6.21   0.000    -.9403884   -.4889815
                  _cons |  -6.653412   .0657679  -101.17   0.000    -6.782315    -6.52451
    -------------------------------------------------------------------------------------
    Last edited by tafel plankje; 17 Apr 2019, 09:43.

    Comment


    • #3
      First, you can generally do interactions with factor variable notation. This makes it easier to use margins.

      Second, clustering generally gives you robust standard errors by cluster. While this usually increases the standard errors, it can reduce them. Whether the variable you cluster on itself is significant in the estimation is irrelevant.

      Comment


      • #4
        Originally posted by Phil Bromiley View Post
        Second, clustering generally gives you robust standard errors by cluster. While this usually increases the standard errors, it can reduce them. Whether the variable you cluster on itself is significant in the estimation is irrelevant.
        Thanks very much.

        I understand cluster() can reduce standard errors.
        In my case, the significance of the determinants ('interaction' and 'continuouspredictor' ) are not influenced by the clustering variable in the cluster() option. It doesn't seem to matter what variable is put in there, the standard errors are consistently very small; even if you put a clustering variable in that essentially doesn't cluster and is the identifier itself. please see the above dataset that I posted. Is that normal behaviour?

        Thanks again.

        Comment


        • #5
          Would any expert be able to comment on the dataset posted in the second reply?
          In short, the problem is that the cluster() option is inflating the signal very much and this is not dependent on the variable used in the cluster() command (try clustering on the identifier for example).
          It seems like a bug in the regression, maybe the cluster() is not appropriate for this size? Any expert opinion on this particular example would be great.

          Many thanks in advance,
          Best, Niek

          Comment


          • #6
            Originally posted by tafel plankje View Post
            However, I now I don't know which estimates I can trust to be accurate, and which I can't trust.
            No expert, but I have a couple of observations.

            First, you have196 880 observations and 196 880 clusters. I don't know what you're trying to do, but I assume that you know that that's not how you're supposed to use the vce(cluster varname) option.

            Second, your dataset is too sparse for two main effects and their interaction:

            .ÿtabulateÿbinarypredictorÿoutcomeÿifÿe(sample)

            binarypredÿ|ÿÿÿÿÿÿÿÿoutcome
            ÿÿÿÿÿictorÿ|ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿTotal
            -----------+----------------------+----------
            ÿÿÿÿÿÿÿÿÿ0ÿ|ÿÿÿ195,915ÿÿÿÿÿÿÿÿ249ÿ|ÿÿÿ196,164ÿ
            ÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿÿ715ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿÿ716ÿ
            -----------+----------------------+----------
            ÿÿÿÿÿTotalÿ|ÿÿÿ196,630ÿÿÿÿÿÿÿÿ250ÿ|ÿÿÿ196,880


            So, I would hesitate to trust either.

            Comment

            Working...
            X