Bootstrapping in Binary Response Data with Few Clusters

Roberto Liebscher

Join Date: Mar 2014

Posts: 92
#1

Bootstrapping in Binary Response Data with Few Clusters

31 Mar 2016, 06:03

Dear Statalisters,

I am learning about the problems when conducting hypothesis tests on a sample with very few clusters (<30). So far, I read the work of Cameron/Gelbach/Miller "Bootstrap-Based Improvements for Inference with Clustered Errors (Review of Economics and Statistics 90, 414–427) [Working Paper here] as well as Cameron and Miller's "Practitioner’s Guide to Cluster-Robust Inference" (Journal of Human Resources 50, 317–370) [Preprint here]. From what I learned if the number of clusters is too small one would reject the null hypothesis too often and bootstraping procedures (possibly with asymptotic refinement) can help to overcome this problem. In their simulations a wild cluster bootstrap t procedure works best with rejection rates very close to the nominal 5%.

Now, I am concerned with transfering their ideas to the case of a binary response model (probit or logit). I started with a mini example to analyze the effect of wage on the likelihood of living in the city.

Code:

version 13.1 webuse nlsw88, clear //"Usual" clustered standard errors probit c_city tenure wage ttl_exp collgrad if industry != ., cluster(industry) //Pairs cluster bootstrap se probit c_city tenure wage ttl_exp collgrad if industry != ., vce(boot, seed(10101) reps(499) cluster(industry)) //Pairs cluster bootstrap t ("z") local theta = _b[wage] local setheta = _se[wage] bootstrap zstar=((_b[wage]-`theta')/_se[wage]), seed(10101) reps(499) cluster(industry) saving(percentilet, replace): probit c_city tenure wage ttl_exp collgrad if industry != ., cluster(industry) preserve use percentilet, clear quietly count if abs(`theta'/`setheta')<abs(zstar) display "p-value = " r(N)/_N restore

Questions:
The p-value for the pairs cluster bootstrap t (or "z") is larger than for the pairs cluster bootstrap se. Of course, I do not know the true data generating process but nevertheless would expect this p-value to be higher than for the pair cluster bootstrap se (second model) in accordance with the Cameron/Gelbach/Miller paper. Does anyone have an idea why we see a different picture here?

As I read in Cameron/Trivedis "Microeconemtrics Using Stata", the wild bootstrap procedure is for linear models only ("For linear regression, a wild bootstrap accomodates the more realistic assumptions that ..."). Is their a non-linear counterpart?

In general, is there a "state-of-the-art"-approach to handle the problem of few clusters when modelling a binary response? If so, how would the (bootstrap?) procedure look like?

Any help is highly appreciated.
Tags: bootstrap, cluster, logit, probit, standard error
Roberto Liebscher

Join Date: Mar 2014

Posts: 92
#2

01 Apr 2016, 17:30

I cross-posted this question on StackExchange where I also added a simulation exercise. Comments are still welcome.
Comment
Johannes Muller

Join Date: May 2018

Posts: 45
#3

31 Jan 2020, 08:05

Hi Roberto,
What has been your conclusion and how have you dealt with it?
Best
Comment

Announcement

Bootstrapping in Binary Response Data with Few Clusters

Comment

Comment