I’m studying the effect of a school reform on school segregation. I compare the actual distribution of children in schools to different counterfactual scenarios which mirror how children would have been distributed over schools had reform not been implemented (or had only parts of it been). I use the commandhutchens to compute segregation indices for the actual and counterfactual scenarios, and I then estimate the effect of reform as the difference between these indices. I want to use bootstrapping to get confidence intervals for these differences.
Following this example for bootstrapping, I’ve written a small program that computes the indices per scenario, and subtracts one from the other. I then bootstrap this program and run it on my yearly data sets. My code is inserted below. My data is in long format, and it’s indicated with a variable, def, if the observation refers to the actual or any of the counterfactual groupings of individuals. As I have a total of four different groupings, the total number of observations in each yearly dataset is hence four times the number of individuals that year. The datasets are quite large (3.5 million individuals x 4 groupings per year).
I’m concerned by two things:
This is my code:
************************************************** ************************************************** ************************
*Specify program
program bshutch, rclass
version 17
hutchens schoolid segvar if def=="Actual"
local fakt2 = r(S)
hutchens schoolid segvar if def=="Counterf1"
local kontra1 = r(S)
hutchens schoolid segvar if def=="Counterf2"
local kontra2 = r(S)
hutchens schoolid segvar if def=="Counterf3"
local kontra3 = r(S)
return scalar diffAC = `fakt2'-`kontra1'
return scalar diffAB = `fakt2'-`kontra2'
return scalar diffBD = `kontra2'-`kontra3'
return scalar diffCD = `kontra1'-`kontra3'
end
*For each year, bootstrap the program
foreach i in 0 1 2 3 {
use policy`i'
bootstrap r(diffAC) r(diffAB) r(diffBD) r(diffCD), reps(1000) seed(4567): bshutch
}
************************************************** ************************************************** ************************
This is what the results look like when I try with 10 reps:
************************************************** ************************************************** ***********************
(running bshutch on estimation sample)
warning: bshutch does not set e(sample), so no observations will be excluded from the resampling because of missing values or other reasons. To exclude observations, press Break, save the data, drop any observations that are to be excluded, and rerun
bootstrap.
Bootstrap replications (10): .........10 done
Bootstrap results Number of obs = 14,401,317
Replications = 10
Command: bshutch2
_bs_1: r(diffAC)
_bs_2: r(diffAB)
_bs_3: r(diffBD)
_bs_4: r(diffCD)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
_bs_1 | .0023558 .0002499 9.43 0.000 .0018661 .0028455
_bs_2 | .0009452 .0002201 4.29 0.000 .0005138 .0013766
_bs_3 | .0018915 .0001608 11.76 0.000 .0015763 .0022067
_bs_4 | .0004809 .0002004 2.40 0.016 .0000882 .0008737
------------------------------------------------------------------------------
************************************************** ************************************************** *****************
I'm really grateful for any help! I'm completely new to stata (but I've worked quite a bit in SAS so I'm not all new to coding et cetera).
Following this example for bootstrapping, I’ve written a small program that computes the indices per scenario, and subtracts one from the other. I then bootstrap this program and run it on my yearly data sets. My code is inserted below. My data is in long format, and it’s indicated with a variable, def, if the observation refers to the actual or any of the counterfactual groupings of individuals. As I have a total of four different groupings, the total number of observations in each yearly dataset is hence four times the number of individuals that year. The datasets are quite large (3.5 million individuals x 4 groupings per year).
I’m concerned by two things:
- It takes a very long time to run the bootstrapping
- The number of observations indicated in the results (14 million ish) equal the full number of observations in the dataset, i.e. the number of rows for all four scenarios. Each comparison should however not include all of those.
This is my code:
************************************************** ************************************************** ************************
*Specify program
program bshutch, rclass
version 17
hutchens schoolid segvar if def=="Actual"
local fakt2 = r(S)
hutchens schoolid segvar if def=="Counterf1"
local kontra1 = r(S)
hutchens schoolid segvar if def=="Counterf2"
local kontra2 = r(S)
hutchens schoolid segvar if def=="Counterf3"
local kontra3 = r(S)
return scalar diffAC = `fakt2'-`kontra1'
return scalar diffAB = `fakt2'-`kontra2'
return scalar diffBD = `kontra2'-`kontra3'
return scalar diffCD = `kontra1'-`kontra3'
end
*For each year, bootstrap the program
foreach i in 0 1 2 3 {
use policy`i'
bootstrap r(diffAC) r(diffAB) r(diffBD) r(diffCD), reps(1000) seed(4567): bshutch
}
************************************************** ************************************************** ************************
This is what the results look like when I try with 10 reps:
************************************************** ************************************************** ***********************
(running bshutch on estimation sample)
warning: bshutch does not set e(sample), so no observations will be excluded from the resampling because of missing values or other reasons. To exclude observations, press Break, save the data, drop any observations that are to be excluded, and rerun
bootstrap.
Bootstrap replications (10): .........10 done
Bootstrap results Number of obs = 14,401,317
Replications = 10
Command: bshutch2
_bs_1: r(diffAC)
_bs_2: r(diffAB)
_bs_3: r(diffBD)
_bs_4: r(diffCD)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
_bs_1 | .0023558 .0002499 9.43 0.000 .0018661 .0028455
_bs_2 | .0009452 .0002201 4.29 0.000 .0005138 .0013766
_bs_3 | .0018915 .0001608 11.76 0.000 .0015763 .0022067
_bs_4 | .0004809 .0002004 2.40 0.016 .0000882 .0008737
------------------------------------------------------------------------------
************************************************** ************************************************** *****************
I'm really grateful for any help! I'm completely new to stata (but I've worked quite a bit in SAS so I'm not all new to coding et cetera).
Comment