Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random sampling multiple times from a large data set, collapse at the country year level, and run regression

    I have a large dataset that contains information on 160,000 enterprises from 43 countries over 15 years. The dataset is unique at the enterprise level.

    I want to do the following 1000 times and store the regression coefficients and standard error:
    • Randomly draw a certain fraction (say 60 percent) of the data
    • Collapse the dataset at the country-year level
    • Run a regression of Y on X (variable X is a categorical variable that goes from 1 to 4)
    Following is my example code:

    use combined_data.dta,clear

    tempname buffer
    capture postutil clear
    postfile `bootstrap' observation intercept se_constant X1 se_X1 X2 se_X2 X3 se_X3 using buffer.dta, replace

    quietly {

    forvalues i = 1(1)1000 {

    bsample round(0.8*_N), strata(country year X)

    collapse (mean) Y, by (country year X)

    regress Y i.X

    post `bootstrap' (`i') (`=_b[_cons]') (`=_se[_cons]') (`=_b[X1]') (`=_se[X1]') (`=_b[X2]') (`=_se[X2]') (`=_b[X3]') (`=_se[X3]')
    }

    }

    postclose `buffer'
    use buffer.dta, clear

    However, it gives me the same coefficients and standard error 1000 times. What am I doing wrong?
    Any help would be much appreciated.
    Last edited by Nusrat Jimi; 01 Jul 2022, 06:08.

  • #2
    Perhaps this will start you in a useful direction to finding out why your program did not work.

    Some general principles:

    1) If your code isn't known to work, don't run it 1000 times. Change the 5 back to 1000 once the code works.
    2) If your code isn't known to work, don't run it "quietly" so you can't see if the intermediate results aren't what you expect. Change noisily back to quietly once the code works.
    3) bsample and collapse both replace the data in memory; you need to preserve your original data and restore it at the bottom of the loop (there are other approaches to preserve and restore, but this is what I chose)
    4) if you want a 60% sample multiply _N by .6 not .8.

    Code:
    tempname bootstrap
    capture postutil clear
    postfile `bootstrap' observation intercept se_constant X1 se_X1 X2 se_X2 X3 se_X3 using buffer.dta, replace
    
    noisily {
    
    forvalues i = 1(1)5 {
    
    preserve
    bsample round(0.6*_N), strata(country year X)
    collapse (mean) Y, by (country year X)
    regress Y i.X
    post `bootstrap' (`i') (`=_b[_cons]') (`=_se[_cons]') (`=_b[X1]') (`=_se[X1]') (`=_b[X2]') (`=_se[X2]') (`=_b[X3]') (`=_se[X3]')
    restore
    
    }
    
    }
    
    postclose `bootstrap'
    use buffer.dta, clear

    Comment


    • #3
      Thank you for the explanation. Much appreciated.

      Comment


      • #4
        William Lisowski Hi. I have another question related to this bootstrapping exercise. Is there any industry standard or rule of thumb on the size of the bootstrap sample? I mean, for my above-mentioned problem, whether I should choose 60 percent of the enterprise level data or 80 percent or 90 percent. I have been searching for a guideline but have not been able to find something specific.

        Comment


        • #5
          I'm afraid my expertise does not include bootstrap methodology - your problems were simple issues of Stata syntax.

          Comment


          • #6
            No problem. Thank you for the quick response.

            Comment


            • #7
              the question in #4 is confusing to me as the standard is to use the full sample and to re-sample from it with replacement - which Stata does for you; why do you want to do something different?

              Comment


              • #8
                Hi. Rich Goldstein Thank you for your comment.
                I know that the classic idea of bootstrap is to use the full sample and to re-sample from it with replacement. But, there is also a size option in the bootstrap command. I also read some discussions that for a large data set, we do not necessarily need to choose the same size as the original sample. https://stats.stackexchange.com/ques...-large-dataset

                I am new to bootstrapping and eager to know more. Any advice would be appreciated.

                Comment


                • #9
                  a lot, of course, depends on who your audience is - however, unless the N is very large (over a billion), I would not do this esp. since you must be careful that you don't run into problems with m/n (about which you are making assumptions); so, personally, given the N in #1, above, I would use the classic method (which will also reduce the amount of explanatory text needed)

                  Comment


                  • #10
                    Thank you Rich Goldstein for the explanation.

                    Comment

                    Working...
                    X