Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrap SE for difference in margins

    My current code runs a nonparametric regression, computes the difference in marginal effects for between each observation and the observation + a constant, then averages that difference. Bootstrapping this program takes quite a long time. I was wondering if there's anything I can do to speed up the process or any preexisting routine that would help (Stata 17 MP). Below is some example code with a publicly accessible dataset just for illustration

    clear all

    use "http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta"


    program define adj_margins_diff, rclass

    npregress kernel lwage exper

    scalar sum_diff = 0

    forvalues i = 1/400 {

    scalar exper_i = exper[`i']

    scalar adj_i = swan_i + 3



    margins, at(exper=`=adj_i')

    matrix b1 = r(b)[1,1]

    margins, at(exper=`=experi_i')

    matrix b2 = r(b)[1,1]

    scalar sum_diff = sum_diff + b1-b2

    }

    scalar adj_avg_diff = sum_diff /400

    return scalar adj_avg_diff = adj_avg_diff

    end


    bootstrap r(adj_avg_diff), reps(1000): adj_margins_diff


  • #2
    npregress is slow and I assume that this is the bottleneck. A small improvement is to run margins with the options nose to avoid the computation of standard errors.
    Best wishes

    Stata 18.0 MP | ORCID | Google Scholar

    Comment


    • #3
      Thanks for your reply.

      I don't believe npregress is what's slowing things down; my specific specification is relatively quick (running in isolation) and it's run outside of the loop through all the observations.

      It seems to be all of the calls of margins. I anecdotally have confirmed this by adding on an additional line that uses margins and increasing the loop length -- the computational time scales highly nonlinearly in both cases.

      By default, my version of stata does not compute SEs with the margins command. It does not appear that adding the nose option makes a difference to the output and I checked and the computation time is the same as well.

      Also, should point out a slight mistake in my code: b1 and b2 are scalar
      Last edited by Paul Bousquet; 15 May 2024, 11:51.

      Comment


      • #4
        You can also run the bootstrapping in parallel, if you have plenty of CPU threads available:
        https://github.com/gvegayon/parallel

        EDIT: and you can call margins only once by typing:

        Code:
        ...
        margins, at(exper = (`adj_i' `experi_i'))
        matrix b1 = r(table)[1,1]
        matrix b2 = r(table)[1,2]
        ...

        Last edited by Felix Bittmann; 15 May 2024, 12:41.
        Best wishes

        Stata 18.0 MP | ORCID | Google Scholar

        Comment


        • #5
          Running in parallel helped quite a bit, thanks. Consolidating the margins calls shaved a little time off. Overall, still at around 30 seconds per rep, but seems like that's as good as it'll get

          Comment


          • #6
            This should be much faster; at least it is for the data example in #1 (which has a couple of errors by the way).

            Code:
            program define adj_margins_diff2 , rclass
                
                npregress kernel lwage exper
                
                mata {
                    
                    observed_weight = uniqrows(st_data((1::400),"exper"),1)
                    
                    st_local("at_observed", invtokens(strofreal(observed_weight[,1]')))
                    st_local("at_adjusted", invtokens(strofreal(observed_weight[,1]':+3)))
                    
                }
                
                margins , at(exper=(`at_observed'))
                matrix b1 = r(b)
                
                margins , at(exper=(`at_adjusted'))
                matrix b2 = r(b)
                
                mata : st_numscalar(                                              ///
                    "adj_avg_diff",                                               ///
                    mean((st_matrix("b2"):-st_matrix("b1"))',observed_weight[,2]) ///
                    )
                
                return scalar adj_avg_diff = adj_avg_diff
                
            end
            The key idea is that exper will have (much) fewer levels than there are observations in the dataset. If this is so, then the approach in #1 and also #4 will estimate the same margins repeadely. I suggest estimating the margins for each level only once and then computing a weighted average. This approach avoids the loop over observations. Also, I use matrices instead of scalars which will probably require more memory but should be much faster. As a bonus, Mata's mean() might be more accurate than summing in Stata; but the differences for the example in #1 are not before the 14th digit, so don't read too much into that.
            Last edited by daniel klein; 16 May 2024, 14:03.

            Comment


            • #7
              Thank you so much Daniel -- this is exactly what I was looking for initially. I tried to do the unique values as well, but my code was apparently too cumbersome to actually get the efficiency gains. This + running in parallel is about a big of a difference as the original with and without parallel (about 50% faster). Still takes quite a long time but it seems like the time per rep decreases as the rep count increases.

              Comment

              Working...
              X