Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrapping in Binary Response Data with Few Clusters

    Dear Statalisters,

    I am learning about the problems when conducting hypothesis tests on a sample with very few clusters (<30). So far, I read the work of Cameron/Gelbach/Miller "Bootstrap-Based Improvements for Inference with Clustered Errors (Review of Economics and Statistics 90, 414–427) [Working Paper here] as well as Cameron and Miller's "Practitioner’s Guide to Cluster-Robust Inference" (Journal of Human Resources 50, 317–370) [Preprint here]. From what I learned if the number of clusters is too small one would reject the null hypothesis too often and bootstraping procedures (possibly with asymptotic refinement) can help to overcome this problem. In their simulations a wild cluster bootstrap t procedure works best with rejection rates very close to the nominal 5%.

    Now, I am concerned with transfering their ideas to the case of a binary response model (probit or logit). I started with a mini example to analyze the effect of wage on the likelihood of living in the city.
    Code:
    version 13.1
    
    webuse nlsw88, clear
    
    //"Usual" clustered standard errors
    probit c_city tenure wage ttl_exp collgrad if industry != ., cluster(industry)
    
    //Pairs cluster bootstrap se
    probit c_city tenure wage ttl_exp collgrad if industry != ., vce(boot, seed(10101) reps(499) cluster(industry))
    
    //Pairs cluster bootstrap t ("z")
    local theta = _b[wage]
    local setheta = _se[wage]
    bootstrap zstar=((_b[wage]-`theta')/_se[wage]), seed(10101) reps(499) cluster(industry) saving(percentilet, replace): probit c_city tenure wage ttl_exp collgrad if industry != ., cluster(industry)
    preserve
    use percentilet, clear
    quietly count if abs(`theta'/`setheta')<abs(zstar)
    display "p-value = " r(N)/_N
    restore
    Questions:
    1. The p-value for the pairs cluster bootstrap t (or "z") is larger than for the pairs cluster bootstrap se. Of course, I do not know the true data generating process but nevertheless would expect this p-value to be higher than for the pair cluster bootstrap se (second model) in accordance with the Cameron/Gelbach/Miller paper. Does anyone have an idea why we see a different picture here?
    2. As I read in Cameron/Trivedis "Microeconemtrics Using Stata", the wild bootstrap procedure is for linear models only ("For linear regression, a wild bootstrap accomodates the more realistic assumptions that ..."). Is their a non-linear counterpart?
    3. In general, is there a "state-of-the-art"-approach to handle the problem of few clusters when modelling a binary response? If so, how would the (bootstrap?) procedure look like?
    Any help is highly appreciated.

  • #2
    I cross-posted this question on StackExchange where I also added a simulation exercise. Comments are still welcome.

    Comment


    • #3
      Hi Roberto,
      What has been your conclusion and how have you dealt with it?
      Best

      Comment

      Working...
      X