Random sampling multiple times from a large data set, collapse at the country year level, and run regression

Nusrat Jimi

Join Date: Mar 2018

Posts: 10
#1

Random sampling multiple times from a large data set, collapse at the country year level, and run regression

01 Jul 2022, 06:05

I have a large dataset that contains information on 160,000 enterprises from 43 countries over 15 years. The dataset is unique at the enterprise level.

I want to do the following 1000 times and store the regression coefficients and standard error:
Randomly draw a certain fraction (say 60 percent) of the data

Collapse the dataset at the country-year level

Run a regression of Y on X (variable X is a categorical variable that goes from 1 to 4)

Following is my example code:

use combined_data.dta,clear

tempname buffer
capture postutil clear
postfile `bootstrap' observation intercept se_constant X1 se_X1 X2 se_X2 X3 se_X3 using buffer.dta, replace

quietly {

forvalues i = 1(1)1000 {

bsample round(0.8*_N), strata(country year X)

collapse (mean) Y, by (country year X)

regress Y i.X

post `bootstrap' (`i') (`=_b[_cons]') (`=_se[_cons]') (`=_b[X1]') (`=_se[X1]') (`=_b[X2]') (`=_se[X2]') (`=_b[X3]') (`=_se[X3]')
}

}

postclose `buffer'
use buffer.dta, clear

However, it gives me the same coefficients and standard error 1000 times. What am I doing wrong?
Any help would be much appreciated.

Last edited by Nusrat Jimi; 01 Jul 2022, 06:08.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

01 Jul 2022, 09:03

Perhaps this will start you in a useful direction to finding out why your program did not work.

Some general principles:

1) If your code isn't known to work, don't run it 1000 times. Change the 5 back to 1000 once the code works.
2) If your code isn't known to work, don't run it "quietly" so you can't see if the intermediate results aren't what you expect. Change noisily back to quietly once the code works.
3) bsample and collapse both replace the data in memory; you need to preserve your original data and restore it at the bottom of the loop (there are other approaches to preserve and restore, but this is what I chose)
4) if you want a 60% sample multiply _N by .6 not .8.

Code:

tempname bootstrap capture postutil clear postfile `bootstrap' observation intercept se_constant X1 se_X1 X2 se_X2 X3 se_X3 using buffer.dta, replace noisily { forvalues i = 1(1)5 { preserve bsample round(0.6*_N), strata(country year X) collapse (mean) Y, by (country year X) regress Y i.X post `bootstrap' (`i') (`=_b[_cons]') (`=_se[_cons]') (`=_b[X1]') (`=_se[X1]') (`=_b[X2]') (`=_se[X2]') (`=_b[X3]') (`=_se[X3]') restore } } postclose `bootstrap' use buffer.dta, clear
1 like
Comment
Nusrat Jimi

Join Date: Mar 2018

Posts: 10
#3

01 Jul 2022, 14:27

Thank you for the explanation. Much appreciated.
Comment
Nusrat Jimi

Join Date: Mar 2018

Posts: 10
#4

14 Jul 2022, 03:28

William Lisowski Hi. I have another question related to this bootstrapping exercise. Is there any industry standard or rule of thumb on the size of the bootstrap sample? I mean, for my above-mentioned problem, whether I should choose 60 percent of the enterprise level data or 80 percent or 90 percent. I have been searching for a guideline but have not been able to find something specific.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

14 Jul 2022, 04:34

I'm afraid my expertise does not include bootstrap methodology - your problems were simple issues of Stata syntax.
Comment
Nusrat Jimi

Join Date: Mar 2018

Posts: 10
#6

14 Jul 2022, 09:30

No problem. Thank you for the quick response.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4546
#7

14 Jul 2022, 11:31

the question in #4 is confusing to me as the standard is to use the full sample and to re-sample from it with replacement - which Stata does for you; why do you want to do something different?
1 like
Comment
Nusrat Jimi

Join Date: Mar 2018

Posts: 10
#8

15 Jul 2022, 09:57

Hi. Rich Goldstein Thank you for your comment.
I know that the classic idea of bootstrap is to use the full sample and to re-sample from it with replacement. But, there is also a size option in the bootstrap command. I also read some discussions that for a large data set, we do not necessarily need to choose the same size as the original sample. https://stats.stackexchange.com/ques...-large-dataset

I am new to bootstrapping and eager to know more. Any advice would be appreciated.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4546
#9

15 Jul 2022, 11:17

a lot, of course, depends on who your audience is - however, unless the N is very large (over a billion), I would not do this esp. since you must be careful that you don't run into problems with m/n (about which you are making assumptions); so, personally, given the N in #1, above, I would use the classic method (which will also reduce the amount of explanatory text needed)
1 like
Comment
Nusrat Jimi

Join Date: Mar 2018

Posts: 10
#10

15 Jul 2022, 11:41

Thank you Rich Goldstein for the explanation.
Comment

Announcement

Random sampling multiple times from a large data set, collapse at the country year level, and run regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment