Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering of standard errors in highly unbalanced pooled cross-sectional data

    Dear Statalist,

    we ran into a problem that concerns the inclusion of (appropriate) clustered standard errors in a multilevel regression model.

    Data: Our data is unbalanced pooled cross-sectional data (i.e., not panel data). Individuals were surveyed over 10 years across 90 countries (total number of individual observations: ~1.0m). Not every country participated in each year. The respondents per country per year are randomly sampled. The table below illustrates our heterogeneous data. Our DV is binary. We have a rich set of controls at the individual level and the country level.



    Analysis: Because our individuals are nested in countries, we perform a multilevel logistic regression using the following command in Stata 17:

    melogit DV IV individual_level_controls country_level_controls year_dummies || country:

    We were asked to additionally include clustered standard errors (vce (cluster country)). We did not include this option right away as we thought that the multilevel structure accounts for the fact that observations within each country are not independent. Also, published studies in our field using a similar setup sometimes include clustered standard errors, and sometimes do not.

    Problem: If we include the vce(robust) command after our melogit || country: command, the significance of our IV changes drastically. (from a p-value of 0.00 to 0.30-40).

    Way forward: We are looking for any suggestions on how to move forward. That is, should we include clustered standard errors or not? We also read the recent paper by MacKinnon et al. (2023) (https://www.sciencedirect.com/scienc...4407622000781), which discusses the issue and states that clustered SE are sometimes too conservative, especially if clusters are very heterogeneous. The paper suggests the use of a wild cluster bootstrap (implemented in STATA via boottest). However, the command does not work after melogit, and the paper seems to be written with linear models in mind in general.

    We would be very happy about some recommendations on how to proceed.
    Last edited by Mirko Hirschmann; 09 Oct 2023, 05:04.

  • #2
    Your post's title suggests that you and your colleagues believe that it has something to do with the imbalance in the cross-sections. If you believe that, then how about balancing the cross-sections (you've got a million observations—plenty to spare) by taking randomly sampled subsets of the larger cross-sections? See whether more balanced cross-sections improves things, at least with respect to the discrepancy between the two methods of computing the coefficients' standard errors.

    Nevertheless, a difference that dramatic between vce() options would imply to me that the model is grossly misspecified in some way. Have you seen this paper that discusses this phenomenon and has suggestions on how to proceed?

    Comment


    • #3
      Thank you very much for your quick answer Joseph.
      Indeed, the difference still remains when focusing only on countries with observations in every year. Do you have a suggestion on how to proceed to detect misspecifications?
      The approach of the paper you recommended seems to be not directly suitable to multilevel logistic regressions if I got it right.

      Comment


      • #4
        Originally posted by Mirko Hirschmann View Post
        . . . the difference still remains when focusing only on countries with observations in every year.
        I understood that your imbalance is more in the numbers of observation between countries.

        Do you have a suggestion on how to proceed to detect misspecifications?
        In the experimental realm, which is what I'm most familiar with, misspecification typically is of the functional form of the relationship, but I gather that in yours it's more problems with omitted variables. What do others in your field of study recommend in such cases?

        In the absence of any guidance from those quarters, perhaps start with what you're most confident are exogenous variables—the intercept, then intercepts for countries (random), then intercepts for years—and build from there (first with any other categorical predictors where the mean is most likely to still be adequately correctly specified) until you hit something that evokes the discrepancy.

        The approach of the paper you recommended seems to be not directly suitable to multilevel logistic regressions if I got it right.
        The authors in the linked paper focus on the linear model where heterogeneity and other potential characteristics of the residuals are prominent in the problem that they are trying to address, but do allude to the generality of the general approach at least in principle to generalized linear and nonlinear models.

        Comment

        Working...
        X