Bootstrapping the difference between results from two commands - number of obs appear too big

Sofia Storm

Join Date: Nov 2023

Posts: 1
#1

Bootstrapping the difference between results from two commands - number of obs appear too big

28 Nov 2023, 14:58

I’m studying the effect of a school reform on school segregation. I compare the actual distribution of children in schools to different counterfactual scenarios which mirror how children would have been distributed over schools had reform not been implemented (or had only parts of it been). I use the commandhutchens to compute segregation indices for the actual and counterfactual scenarios, and I then estimate the effect of reform as the difference between these indices. I want to use bootstrapping to get confidence intervals for these differences.

Following this example for bootstrapping, I’ve written a small program that computes the indices per scenario, and subtracts one from the other. I then bootstrap this program and run it on my yearly data sets. My code is inserted below. My data is in long format, and it’s indicated with a variable, def, if the observation refers to the actual or any of the counterfactual groupings of individuals. As I have a total of four different groupings, the total number of observations in each yearly dataset is hence four times the number of individuals that year. The datasets are quite large (3.5 million individuals x 4 groupings per year).

I’m concerned by two things:
It takes a very long time to run the bootstrapping

The number of observations indicated in the results (14 million ish) equal the full number of observations in the dataset, i.e. the number of rows for all four scenarios. Each comparison should however not include all of those.

Am I missing something here? It feels like the bootstrapping uses the full data, hence ignoring the if-specifications. The computed differences diffAC diffAB et cetera appear correct as such (they are the same as when I leave out the bootstrapping part and just compare the index values yearly as they are), but I’m concerned the confidence limits from bootstrapping are artificially narrow due to some misspecification in the code. Is there a problem? Any ideas what that could be? How do I fix it?

This is my code:

************************************************** ************************************************** ************************

*Specify program

program bshutch, rclass
version 17

hutchens schoolid segvar if def=="Actual"
local fakt2 = r(S)

hutchens schoolid segvar if def=="Counterf1"
local kontra1 = r(S)

hutchens schoolid segvar if def=="Counterf2"
local kontra2 = r(S)

hutchens schoolid segvar if def=="Counterf3"
local kontra3 = r(S)

return scalar diffAC = `fakt2'-`kontra1'
return scalar diffAB = `fakt2'-`kontra2'
return scalar diffBD = `kontra2'-`kontra3'
return scalar diffCD = `kontra1'-`kontra3'

end

*For each year, bootstrap the program

foreach i in 0 1 2 3 {
use policy`i'
bootstrap r(diffAC) r(diffAB) r(diffBD) r(diffCD), reps(1000) seed(4567): bshutch
}

************************************************** ************************************************** ************************

This is what the results look like when I try with 10 reps:

************************************************** ************************************************** ***********************

(running bshutch on estimation sample)

warning: bshutch does not set e(sample), so no observations will be excluded from the resampling because of missing values or other reasons. To exclude observations, press Break, save the data, drop any observations that are to be excluded, and rerun
bootstrap.

Bootstrap replications (10): .........10 done

Bootstrap results Number of obs = 14,401,317
Replications = 10

Command: bshutch2
_bs_1: r(diffAC)
_bs_2: r(diffAB)
_bs_3: r(diffBD)
_bs_4: r(diffCD)

------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
_bs_1 | .0023558 .0002499 9.43 0.000 .0018661 .0028455
_bs_2 | .0009452 .0002201 4.29 0.000 .0005138 .0013766
_bs_3 | .0018915 .0001608 11.76 0.000 .0015763 .0022067
_bs_4 | .0004809 .0002004 2.40 0.016 .0000882 .0008737
------------------------------------------------------------------------------

************************************************** ************************************************** *****************

I'm really grateful for any help! I'm completely new to stata (but I've worked quite a bit in SAS so I'm not all new to coding et cetera).

Last edited by Sofia Storm; 28 Nov 2023, 15:07.
Tags: bootstrap, bootstrapping, confidence intervals, hutchens
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#2

29 Nov 2023, 09:11

(I see that the -hutchens- command is available at -ssc describe hutchens-.)

By default, -bootstrap- takes samples at the observed sample size. The report of " ... = 14,401,317" would be correct; that is, -bootstrap- is reporting the sample size of the data it received, not reporting the sample size when each of your conditions are applied. I would expect this to take a long time as, 1000 samples of size 14e6 is a large task.

I'm not sure why you think your "if" conditions are not working. One (clumsy) way to check this would to set -reps- to some reasonable number and use the -noisily- option for -bootstrap-. You could also insert a few -display- commands in your program such as -display "N = " r(Nobs) - right after your -hutchens- commands. (I haven't used -hutchens- before, so I'm just working from its -help- file. It may display such information anyway.)
Comment

Announcement

Bootstrapping the difference between results from two commands - number of obs appear too big

Comment