Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drawing random sample multiple times from a large data set

    Hi, I would like to know how can I draw a smaller sample (of say 20000) from an already existing large data set such as Demographic Health Survey using Monte Carlo Simulation. I want to use 1000 repitions to generate a beta coefficient value to check its consistency. I tried something like this. But it gave me one constant value of the 1000 beta. Would be glad if anyone can point out my mistake.

    Code:
    gen beta=.
    
    quietly{
    forvalues i=1(1)1000 {
       
        preserve
        
        //generating a random number and drawing first 20000 as samples from the data//
        set seed 135790
        gen random=runiform()
        sort random
        gen insample=_n<=20000
    
        //Panel regression//
    
        xtset id time
        xtreg y lag_y x1 x2 x3 x4 x5 //The variables are obtained from the already available data set//
        
        local coeff=_b[lag_y]
        restore 
        
        //Store the beta values//
        replace beta=`coeff' in `i'
        
        }
    }
    
    summ beta
    Thanks in advance!

  • #2
    Take your -set seed- command outside (and before) the loop! By resetting the seed on each iteration of the loop you are forcing Stata to repeat the same random numbers instead of progressing through the stream to new ones.

    Added: Although this is unrelated to the question you raised, I note that you are slowing this code down enormously by needlessly thrashing your disk. There is no need to -preserve- and -restore- your data set given that you don't have to do anything destructive to the data in each run. To gain this efficiency, however, you need to save your betas in a postfile rather than in the original data set. (There is no reason to save them in the original data set anyway, in fact, it's bad data management practice, because there is no connection between the pre-existing data in a given observation and the beta coefficient you happen to end up putting in that particular observation.) Also, there is no need to "launder" _b[lag_y] through a local macro in order to save it: you can just put it there directly and avoid a potential loss of precision as a result.

    Code:
    gen random = .
    set seed 135790
    
    xtset id time
    
    
    capture postutil clear
    postfile handle int i double beta using betas, replace
    
    quietly{
        forvalues i=1(1)1000 {
      
            //generating a random number and drawing first 20000 as samples from the data//
            replace random=runiform()
            sort random
    
            //Panel regression//
    
            xtreg y lag_y x1 x2 x3 x4 x5 in 1/20000 //The variables are obtained from the already available data set//
            
            //Store the beta values//
           post handle (`i') (_b[lag_y])
        
        }
    }
    
    postclose handle
    use betas, clear
    summ beta
    Last edited by Clyde Schechter; 19 Dec 2018, 12:34.

    Comment


    • #3
      As far as I can see in these solutions the panel data structure is not being respected, which is not necessary a good idea. Panel data estimators such as the random effect regression above are justified as the cross sectional dimension goes to infinity, and the time series dimension is taken as fixed/given.

      I would sub-sample cross sectional units if I had to do something like this, and not individual observations.

      You can see the thread here for related discussion with a couple of variants of this sampling that respects the panel data structure.

      https://www.statalist.org/forums/for...rom-panel-data

      Comment


      • #4
        Joro Kolev makes several good points here. Simply selecting random observations, rather than random id's, for this purpose may give you results that don't mean what you want them to mean.

        Comment


        • #5
          This is indeed very helpful. I would really appreciate if you can help me with the codes that can preserve the panel structure of the data.

          Comment


          • #6
            It's not all that dissimilar to what you already have. Presumably you won't want to sample 20,000 id's but some smaller number that will, on average, result in about 20,000 observations being selected. Since I know nearly nothing about your data, I have arbitrarily written the code to select 1500 id's. Modify according to your needs.

            Code:
            gen random = .
            gen byte in_sample = .
            set seed 135790
            
            xtset id time
            
            by id, sort: gen flag = 1 if _n == 1    // NOTE: CODED AS 1/. SO  FLAGS SORT FIRST
            
            capture postutil clear
            postfile handle int i double beta using betas, replace
            
            quietly{
                forvalues i=1(1)1000 {
              
                    //generating a random number and drawing first 1500 IDs as samples from the data//
                    replace random=runiform() if flag == 1
                    sort flag (random)
                    replace in_sample = (_n <= 1500)
                    by id (flag random), sort: replace in_sample = in_sample[1]
                    
                   //Panel regression//
            
                    xtreg y lag_y x1 x2 x3 x4 x5 if in_sample //The variables are obtained from the already available data set//
                   
                    //Store the beta values//
                   post handle (`i') (_b[lag_y])
                
                }
            }
            
            postclose handle
            use betas, clear
            summ beta
            The logic is to first flag a single observation from each id. Then the random numbers are assigned just to those flagged observations. The observations are then sorted on the random number (for the flagged observations) and the first 1500 are marked as in sample. The in sample designation is then spread to all other observations in the id groups and the regression is performed on those observations. Note that, because I am assuming that different id's can have different number of observations, the total number of observations sampled will vary from one iteration of the loop to the next--consequently the need to do the -xtreg- command -if in_sample- rather than specifying a certain number of observations with -in-.

            Comment


            • #7
              Thanks a ton! This is very helpful indeed. I can clearly get the logic now. Will try this on my dataset. Thank you once again.

              Comment


              • #8
                Clyde Schechter presents excellent code above, and his code is worthy of studying in detail because on the top of the excellent and speedy accomplishment of the task at hand, the code also illustrates many useful techniques that come handy in manipulating panel data.

                Nevertheless I will also present below my solution to the problem. My code is inferior to Clyde's in terms of speed. But my code illustrates a technique which is crucial and should be mastered by any student of panel data.

                This technique is switching between long and wide form of the panel data, and operating on whichever form is more convenient for the task at hand.

                Here after I switch to wide format of panel data, our problem of "respecting the panel data structure" reduces to simple sampling of cross sectional data.

                Code:
                timer clear
                
                timer on 1
                
                webuse grunfeld , clear
                
                set seed 135790
                
                reshape wide invest mvalue kstock time, i(company) j(year)
                
                
                capture postutil clear
                postfile handle int i double beta using betas, replace
                
                
                 qui forvalues i=1(1)1000 {
                  
                 gen random=runiform()
                 
                 sort random
                 
                 drop random
                
                 preserve 
                 
                keep in 1/5
                
                reshape long
                
                xtset company year
                       
                xtreg invest mvalue kstock
                        
                post handle (`i') (_b[mvalue])
                    
                restore
                
                }
                
                
                postclose handle
                use betas, clear
                summ beta
                
                
                timer off 1
                
                timer list

                Comment


                • #9
                  My code runs for 154.05 seconds, Clyde's code (adapted to the dataset I used as an example) runs for 43.77 seconds. Yet I think that my code is simpler to read to a user who is comfortable with casually switching between wide and long format of panel data. (Also I guess that I am slowing down my code a lot by the -preserve/restore- implementation, probably if I just flag the sample as Clyde does, my code will speed up a bit too.)

                  Comment


                  • #10
                    #6

                    Hi Clyde,
                    Could you please suggest what changes in the command I would need to make to save all the coefficients and standard errors following #6. I have tried your command in Grunfiled web data.


                    HTML Code:
                    webuse grunfeld , clear
                    gen random = .
                    gen byte in_sample = .
                    set seed 135790
                    
                    xtset company year
                    
                    by company, sort: gen flag = 1 if _n == 1    // NOTE: CODED AS 1/. SO  FLAGS SORT FIRST
                    
                    capture postutil clear
                    postfile handle int i double beta using betas, replace
                    
                    quietly{
                        forvalues i=1(1)1000 {
                      
                            //generating a random number and drawing first 30 IDs as samples from the data//
                            replace random=runiform() if flag == 1
                            sort flag (random)
                            replace in_sample = (_n <= 30)
                            by company (flag random), sort: replace in_sample = in_sample[1]
                            
                           //Panel regression//
                    
                            xtreg invest mvalue kstock if in_sample //From the grunfield web dataset//
                          
                            //Store the beta values//
                           post handle (`i') (_b[mvalue])
                        
                        }
                    }
                    
                    postclose handle
                    use betas, clear
                    summ beta



                    The command works well. It is just that I do not know what adjustments in the code should I make to have both the coefficients (mvalue and kstock) and their standard errors saved in the postfile. Please accept my apologies if I have violated any norm of posting question, it took quite sometime to figure out how I can post.
                    Thank you
                    Vikas
                    Last edited by Dev Vikas; 13 Dec 2021, 13:47.

                    Comment


                    • #11
                      Well, I'm not sure exactly what you want for the final output. When you -summ beta- at the end, you get mean, standard deviation, and range of the coefficients. Now, you can do that for standard errors as well, but I don't think any of that is meaningful. So in the code below, I assume you just want the mean value (although mean values of standard errors, z scores, pvalues, and confidence limits aren't meaningful either.) Anyway, take a look at what this gives you and then perhaps you can modify the exact output from there.

                      Note also that I save these results in a tempfile rather than a permanent file--just to avoid cluttering up my drive while editing the code. But you might want to save the details in a real file. So you can just put a -save- command in before the -collapse- command if you want to do that.

                      Code:
                      webuse grunfeld , clear
                      gen random = .
                      gen byte in_sample = .
                      set seed 135790
                      
                      xtset company year
                      
                      by company, sort: gen flag = 1 if _n == 1    // NOTE: CODED AS 1/. SO  FLAGS SORT FIRST
                      
                      capture postutil clear
                      tempfile all_estimates
                      postfile handle int i str32 vble double b se z p ll ul using `all_estimates', replace
                      
                      quietly{
                          forvalues i=1(1)1000 {
                       
                              //generating a random number and drawing first 30 IDs as samples from the data//
                              replace random=runiform() if flag == 1
                              sort flag (random)
                              replace in_sample = (_n <= 30)
                              by company (flag random), sort: replace in_sample = in_sample[1]
                              
                             //Panel regression//
                      
                              xtreg invest mvalue kstock if in_sample //From the grunfield web dataset//
                            
                              //Store the estimate values//
                              matrix T = r(table)
                              
                             foreach v in mvalue kstock _cons {
                              post handle (`i') ("`v'") (T["b", "`v'"]) (T["se", "`v'"]) ///
                                  (T["z", "`v'"]) (T["pvalue", "`v'"]) (T["ll", "`v'"]) ///
                                  (T["ul", "`v'"])
                             }
                          }
                      }
                      
                      
                      postclose handle
                      use `all_estimates', clear
                      collapse (mean) b se z p ll ul, by(vble)
                      list, noobs clean

                      Comment


                      • #12
                        Hi Clyde,
                        Thank you for your quick response and this works for me. My aim to save standard errors along with betas is to see repeating 1000 times the regression from 30 random samples, what fraction of the beta would be significant (null rejected). I could not think of a smarter way than to have standard errors along with the coefficients.
                        Many thanks
                        Vikas

                        Comment

                        Working...
                        X