Bootstrap SE for difference in margins

Paul Bousquet

Join Date: May 2024

Posts: 6
#1

Bootstrap SE for difference in margins

15 May 2024, 10:08

My current code runs a nonparametric regression, computes the difference in marginal effects for between each observation and the observation + a constant, then averages that difference. Bootstrapping this program takes quite a long time. I was wondering if there's anything I can do to speed up the process or any preexisting routine that would help (Stata 17 MP). Below is some example code with a publicly accessible dataset just for illustration

clear all

use "http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta"

program define adj_margins_diff, rclass

npregress kernel lwage exper

scalar sum_diff = 0

forvalues i = 1/400 {

scalar exper_i = exper[`i']

scalar adj_i = swan_i + 3

margins, at(exper=`=adj_i')

matrix b1 = r(b)[1,1]

margins, at(exper=`=experi_i')

matrix b2 = r(b)[1,1]

scalar sum_diff = sum_diff + b1-b2

}

scalar adj_avg_diff = sum_diff /400

return scalar adj_avg_diff = adj_avg_diff

end

bootstrap r(adj_avg_diff), reps(1000): adj_margins_diff
Tags: bootstrap, margins, npregress, Suggestion
Felix Bittmann

Join Date: Aug 2018

Posts: 691
#2

15 May 2024, 11:15

npregress is slow and I assume that this is the bottleneck. A small improvement is to run margins with the options nose to avoid the computation of standard errors.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Paul Bousquet

Join Date: May 2024

Posts: 6
#3

15 May 2024, 11:49

Thanks for your reply.

I don't believe npregress is what's slowing things down; my specific specification is relatively quick (running in isolation) and it's run outside of the loop through all the observations.

It seems to be all of the calls of margins. I anecdotally have confirmed this by adding on an additional line that uses margins and increasing the loop length -- the computational time scales highly nonlinearly in both cases.

By default, my version of stata does not compute SEs with the margins command. It does not appear that adding the nose option makes a difference to the output and I checked and the computation time is the same as well.

Also, should point out a slight mistake in my code: b1 and b2 are scalar

Last edited by Paul Bousquet; 15 May 2024, 11:51.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 691
#4

15 May 2024, 12:35

You can also run the bootstrapping in parallel, if you have plenty of CPU threads available:
https://github.com/gvegayon/parallel

EDIT: and you can call margins only once by typing:

Code:

... margins, at(exper = (`adj_i' `experi_i')) matrix b1 = r(table)[1,1] matrix b2 = r(table)[1,2] ...

Last edited by Felix Bittmann; 15 May 2024, 12:41.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Paul Bousquet

Join Date: May 2024

Posts: 6
#5

16 May 2024, 11:40

Running in parallel helped quite a bit, thanks. Consolidating the margins calls shaved a little time off. Overall, still at around 30 seconds per rep, but seems like that's as good as it'll get
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#6

16 May 2024, 13:59

This should be much faster; at least it is for the data example in #1 (which has a couple of errors by the way).

Code:

program define adj_margins_diff2 , rclass npregress kernel lwage exper mata { observed_weight = uniqrows(st_data((1::400),"exper"),1) st_local("at_observed", invtokens(strofreal(observed_weight[,1]'))) st_local("at_adjusted", invtokens(strofreal(observed_weight[,1]':+3))) } margins , at(exper=(`at_observed')) matrix b1 = r(b) margins , at(exper=(`at_adjusted')) matrix b2 = r(b) mata : st_numscalar( /// "adj_avg_diff", /// mean((st_matrix("b2"):-st_matrix("b1"))',observed_weight[,2]) /// ) return scalar adj_avg_diff = adj_avg_diff end

The key idea is that exper will have (much) fewer levels than there are observations in the dataset. If this is so, then the approach in #1 and also #4 will estimate the same margins repeadely. I suggest estimating the margins for each level only once and then computing a weighted average. This approach avoids the loop over observations. Also, I use matrices instead of scalars which will probably require more memory but should be much faster. As a bonus, Mata's mean() might be more accurate than summing in Stata; but the differences for the example in #1 are not before the 14th digit, so don't read too much into that.

Last edited by daniel klein; 16 May 2024, 14:03.
2 likes
Comment
Paul Bousquet

Join Date: May 2024

Posts: 6
#7

16 May 2024, 19:30

Thank you so much Daniel -- this is exactly what I was looking for initially. I tried to do the unique values as well, but my code was apparently too cumbersome to actually get the efficiency gains. This + running in parallel is about a big of a difference as the original with and without parallel (about 50% faster). Still takes quite a long time but it seems like the time per rep decreases as the rep count increases.
Comment

Announcement

Bootstrap SE for difference in margins

Comment

Comment

Comment

Comment

Comment

Comment