Drawing random sample multiple times from a large data set

Kajori Banerjee

Join Date: Feb 2015

Posts: 22
#1

Drawing random sample multiple times from a large data set

19 Dec 2018, 11:56

Hi, I would like to know how can I draw a smaller sample (of say 20000) from an already existing large data set such as Demographic Health Survey using Monte Carlo Simulation. I want to use 1000 repitions to generate a beta coefficient value to check its consistency. I tried something like this. But it gave me one constant value of the 1000 beta. Would be glad if anyone can point out my mistake.

Code:

gen beta=. quietly{ forvalues i=1(1)1000 { preserve //generating a random number and drawing first 20000 as samples from the data// set seed 135790 gen random=runiform() sort random gen insample=_n<=20000 //Panel regression// xtset id time xtreg y lag_y x1 x2 x3 x4 x5 //The variables are obtained from the already available data set// local coeff=_b[lag_y] restore //Store the beta values// replace beta=`coeff' in `i' } } summ beta

Thanks in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

19 Dec 2018, 12:18

Take your -set seed- command outside (and before) the loop! By resetting the seed on each iteration of the loop you are forcing Stata to repeat the same random numbers instead of progressing through the stream to new ones.

Added: Although this is unrelated to the question you raised, I note that you are slowing this code down enormously by needlessly thrashing your disk. There is no need to -preserve- and -restore- your data set given that you don't have to do anything destructive to the data in each run. To gain this efficiency, however, you need to save your betas in a postfile rather than in the original data set. (There is no reason to save them in the original data set anyway, in fact, it's bad data management practice, because there is no connection between the pre-existing data in a given observation and the beta coefficient you happen to end up putting in that particular observation.) Also, there is no need to "launder" _b[lag_y] through a local macro in order to save it: you can just put it there directly and avoid a potential loss of precision as a result.

Code:

gen random = . set seed 135790 xtset id time capture postutil clear postfile handle int i double beta using betas, replace quietly{ forvalues i=1(1)1000 { //generating a random number and drawing first 20000 as samples from the data// replace random=runiform() sort random //Panel regression// xtreg y lag_y x1 x2 x3 x4 x5 in 1/20000 //The variables are obtained from the already available data set// //Store the beta values// post handle (`i') (_b[lag_y]) } } postclose handle use betas, clear summ beta

Last edited by Clyde Schechter; 19 Dec 2018, 12:34.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

19 Dec 2018, 12:56

As far as I can see in these solutions the panel data structure is not being respected, which is not necessary a good idea. Panel data estimators such as the random effect regression above are justified as the cross sectional dimension goes to infinity, and the time series dimension is taken as fixed/given.

I would sub-sample cross sectional units if I had to do something like this, and not individual observations.

You can see the thread here for related discussion with a couple of variants of this sampling that respects the panel data structure.

https://www.statalist.org/forums/for...rom-panel-data
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

19 Dec 2018, 13:03

Joro Kolev makes several good points here. Simply selecting random observations, rather than random id's, for this purpose may give you results that don't mean what you want them to mean.
Comment
Kajori Banerjee

Join Date: Feb 2015

Posts: 22
#5

19 Dec 2018, 13:14

This is indeed very helpful. I would really appreciate if you can help me with the codes that can preserve the panel structure of the data.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

19 Dec 2018, 13:53

It's not all that dissimilar to what you already have. Presumably you won't want to sample 20,000 id's but some smaller number that will, on average, result in about 20,000 observations being selected. Since I know nearly nothing about your data, I have arbitrarily written the code to select 1500 id's. Modify according to your needs.

Code:

gen random = . gen byte in_sample = . set seed 135790 xtset id time by id, sort: gen flag = 1 if _n == 1 // NOTE: CODED AS 1/. SO FLAGS SORT FIRST capture postutil clear postfile handle int i double beta using betas, replace quietly{ forvalues i=1(1)1000 { //generating a random number and drawing first 1500 IDs as samples from the data// replace random=runiform() if flag == 1 sort flag (random) replace in_sample = (_n <= 1500) by id (flag random), sort: replace in_sample = in_sample[1] //Panel regression// xtreg y lag_y x1 x2 x3 x4 x5 if in_sample //The variables are obtained from the already available data set// //Store the beta values// post handle (`i') (_b[lag_y]) } } postclose handle use betas, clear summ beta

The logic is to first flag a single observation from each id. Then the random numbers are assigned just to those flagged observations. The observations are then sorted on the random number (for the flagged observations) and the first 1500 are marked as in sample. The in sample designation is then spread to all other observations in the id groups and the regression is performed on those observations. Note that, because I am assuming that different id's can have different number of observations, the total number of observations sampled will vary from one iteration of the loop to the next--consequently the need to do the -xtreg- command -if in_sample- rather than specifying a certain number of observations with -in-.
2 likes
Comment
Kajori Banerjee

Join Date: Feb 2015

Posts: 22
#7

19 Dec 2018, 13:57

Thanks a ton! This is very helpful indeed. I can clearly get the logic now. Will try this on my dataset. Thank you once again.
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

20 Dec 2018, 12:26

Clyde Schechter presents excellent code above, and his code is worthy of studying in detail because on the top of the excellent and speedy accomplishment of the task at hand, the code also illustrates many useful techniques that come handy in manipulating panel data.

Nevertheless I will also present below my solution to the problem. My code is inferior to Clyde's in terms of speed. But my code illustrates a technique which is crucial and should be mastered by any student of panel data.

This technique is switching between long and wide form of the panel data, and operating on whichever form is more convenient for the task at hand.

Here after I switch to wide format of panel data, our problem of "respecting the panel data structure" reduces to simple sampling of cross sectional data.

Code:

timer clear timer on 1 webuse grunfeld , clear set seed 135790 reshape wide invest mvalue kstock time, i(company) j(year) capture postutil clear postfile handle int i double beta using betas, replace qui forvalues i=1(1)1000 { gen random=runiform() sort random drop random preserve keep in 1/5 reshape long xtset company year xtreg invest mvalue kstock post handle (`i') (_b[mvalue]) restore } postclose handle use betas, clear summ beta timer off 1 timer list
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#9

20 Dec 2018, 12:32

My code runs for 154.05 seconds, Clyde's code (adapted to the dataset I used as an example) runs for 43.77 seconds. Yet I think that my code is simpler to read to a user who is comfortable with casually switching between wide and long format of panel data. (Also I guess that I am slowing down my code a lot by the -preserve/restore- implementation, probably if I just flag the sample as Clyde does, my code will speed up a bit too.)
1 like
Comment

Dev Vikas

Join Date: Sep 2017
Posts: 7

#10

13 Dec 2021, 13:18

#6

Hi Clyde,
Could you please suggest what changes in the command I would need to make to save all the coefficients and standard errors following #6. I have tried your command in Grunfiled web data.

HTML Code:

webuse grunfeld , clear
gen random = .
gen byte in_sample = .
set seed 135790

xtset company year

by company, sort: gen flag = 1 if _n == 1    // NOTE: CODED AS 1/. SO  FLAGS SORT FIRST

capture postutil clear
postfile handle int i double beta using betas, replace

quietly{
    forvalues i=1(1)1000 {
  
        //generating a random number and drawing first 30 IDs as samples from the data//
        replace random=runiform() if flag == 1
        sort flag (random)
        replace in_sample = (_n <= 30)
        by company (flag random), sort: replace in_sample = in_sample[1]
        
       //Panel regression//

        xtreg invest mvalue kstock if in_sample //From the grunfield web dataset//
      
        //Store the beta values//
       post handle (`i') (_b[mvalue])
    
    }
}

postclose handle
use betas, clear
summ beta

The command works well. It is just that I do not know what adjustments in the code should I make to have both the coefficients (mvalue and kstock) and their standard errors saved in the postfile. Please accept my apologies if I have violated any norm of posting question, it took quite sometime to figure out how I can post.
Thank you
Vikas

Last edited by Dev Vikas; 13 Dec 2021, 13:47.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

#11

13 Dec 2021, 13:52

Well, I'm not sure exactly what you want for the final output. When you -summ beta- at the end, you get mean, standard deviation, and range of the coefficients. Now, you can do that for standard errors as well, but I don't think any of that is meaningful. So in the code below, I assume you just want the mean value (although mean values of standard errors, z scores, pvalues, and confidence limits aren't meaningful either.) Anyway, take a look at what this gives you and then perhaps you can modify the exact output from there.

Note also that I save these results in a tempfile rather than a permanent file--just to avoid cluttering up my drive while editing the code. But you might want to save the details in a real file. So you can just put a -save- command in before the -collapse- command if you want to do that.

Code:

webuse grunfeld , clear
gen random = .
gen byte in_sample = .
set seed 135790

xtset company year

by company, sort: gen flag = 1 if _n == 1    // NOTE: CODED AS 1/. SO  FLAGS SORT FIRST

capture postutil clear
tempfile all_estimates
postfile handle int i str32 vble double b se z p ll ul using `all_estimates', replace

quietly{
    forvalues i=1(1)1000 {
 
        //generating a random number and drawing first 30 IDs as samples from the data//
        replace random=runiform() if flag == 1
        sort flag (random)
        replace in_sample = (_n <= 30)
        by company (flag random), sort: replace in_sample = in_sample[1]
        
       //Panel regression//

        xtreg invest mvalue kstock if in_sample //From the grunfield web dataset//
      
        //Store the estimate values//
        matrix T = r(table)
        
       foreach v in mvalue kstock _cons {
        post handle (`i') ("`v'") (T["b", "`v'"]) (T["se", "`v'"]) ///
            (T["z", "`v'"]) (T["pvalue", "`v'"]) (T["ll", "`v'"]) ///
            (T["ul", "`v'"])
       }
    }
}


postclose handle
use `all_estimates', clear
collapse (mean) b se z p ll ul, by(vble)
list, noobs clean

Comment

Dev Vikas

Join Date: Sep 2017

Posts: 7
#12

13 Dec 2021, 17:22

Hi Clyde,
Thank you for your quick response and this works for me. My aim to save standard errors along with betas is to see repeating 1000 times the regression from 30 random samples, what fraction of the beta would be significant (null rejected). I could not think of a smarter way than to have standard errors along with the coefficients.
Many thanks
Vikas
Comment

Announcement