Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to take a random sample of panel data and keep all person-year-observations for a particular ID

    I try to take a random sample from a huge unbalanced panel dataset. For the MWE data, I would like to randomnly choose either 513 or 514. But whatever ID is picked at random it should keep all year-data from that person. I call it a 'random panel sample'. I havn't found anything in the
    Code:
    sample
    documentation.

    Code:
    clear
    input     year     pid var
            2003     513 1500    
            2004     513 1550
            2005     513 1500    
            2006     513 1600    
            2003     514 1600    
            2004     514 1600
            2005     514 1700    
            2006     514 1800            
    end
    I tried to combine
    Code:
    sample
    with
    Code:
    bysort
    like
    Code:
    bysort pid (year): sample 1
    but it always drops all observations for me. Thank you very much.

  • #2
    Code:
    tempfile holding
    save `holding'
    
    keep pid
    duplicates drop
    
    set seed 1234
    sample 1, count
    
    merge 1:m pid using `holding', assert(match using) keep(match) nogenerate

    Comment


    • #3
      That is absolutely perfect! And I'm kind of glad that I didn't just missed one simple command. This will be very valuable for my analysis.

      Comment


      • #4
        Hi guys,

        Is it possible to write the code within a loop? If you want perhaps 10 different random samples?

        Comment


        • #5
          Yes, but in what sense do you "want perhaps 10 different random samples?" Do you want to save them in 10 separate data sets, or have them all appended in one data set. Or perhaps you want to generate them one at a time and do some analysis on each, and then do something with those analytic results? Be more specific.

          Comment


          • #6
            You're absolutely right - let me try to be more clear.

            I too have panel data, and want to investigate a change over time in an outcome. I'm planning to estimate what Singer & Willet (2003) calls a "multilevel model for change". Thus, I first want to eyeball whether a linear model seems to be a right functional form at the individual level.

            What I want is 10 graphs of 10 different random subsamples. I can run it 10 times with different seed, but perhaps a loop could do the job more efficiently? Especially if one wanted 100 graphs.

            Code:
            forvalues i=1/10 {
            use "data.dta", replace
            
            tempfile holding
            save `holding'
            keep ID
            duplicates drop
            set seed 1234`i' // This seed needs to be changed everytime the code is run to get a different "random" subsample
            sample 25, count
            merge 1:m ID using `holding', assert (match using) keep(match) nogenerate
            
            twoway (scatter outcome time) (lfit outcome time), by (ID)
            graph save "Graph`i'"
            }
            Is this how you would solve it too, Clyde?
            Last edited by Mads Moring; 02 Sep 2020, 02:32.

            Comment


            • #7
              You don't need any file choreography here. You can do it in place.

              tag each panel once

              foreach iteration {
              select a random sample of the tagged values
              spread selection to other observations in panel
              do whatever to selection
              }

              Here is a start on a framework. Concretely there are 20 panels; I select 7 of them in 10 samples.


              Code:
              webuse grunfeld, clear
              
              egen tag = tag(company)
              
              set seed 2803
              
              gen shuffle = .  
              gen sampled = . 
              
              forval j = 1/10 { 
                  qui replace shuffle = runiform()
                  sort tag shuffle 
              
                 * the 20 tagged values are at the end; we select 7 of them 
                  qui replace sampled = inrange(_n, _N-6, _N)
              
                  * spread selection to each panel 
                  qui bysort company (sampled) : replace sampled = sampled[_N]
              
                  levelsof company if sampled 
              
                   * here is what you do something to the sample 
              }

              Comment


              • #8
                Hello Everyone, I have a similar problem Marco. Only that my data is divided into 4 groups and I would like to randomly select 25% of data from each group. But for an ID (selected under a group), I need to keep all rows for that ID (i.e. the code should randomly select an ID under a group but all information about that ID). Your help is highly appreciated.

                Comment


                • #9
                  Hello, Clyde Schechter !

                  Can I use the same command you provided in #8 for the following situation: I want to select a sample in a balanced panel data in a way I keep the representativeness of the variable "region" and also keep the panel balanced with information of the person id for each year of the panel.

                  Thanks in advance!

                  Comment


                  • #10
                    I'm not sure what you are referring to. I did not write #8, and there is no code in it either. Nick Cox' cod in #7 will fulfill your need, though if you are only looking for a single random sample there is no need for the -forvalues j = 1/10- loop.

                    Comment


                    • #11
                      I'm sorry fot that Clyde Schechter , I just wrote wrong, I wanted to say in #2. I just tried this code:
                      local fraction = 0.10
                      local first = 2010
                      gen rand = runiform()
                      bysort year (rand): gen byte flag = (_n <= (0.10 * _N)) & (year == 2010)
                      egen keeper = max(flag), by(id)


                      But, what I need is a random and stratified sample for the variable region in a way I can keep the id chosen by this process for the whole period, keeping the panel structure.

                      Comment


                      • #12
                        The code you show in #11 will choose a 10% random sample, and any id that is included in the sample will have all of its observations included. The code in #2 will choose a 1% random sample of pid's and also includes all observations for any pid that is included. As you can see in this thread, there are many ways of going about this.

                        However, this is a simple random sample of id's, not a stratified sample. You don't say how you want to stratify, so I can't guide you in how to go about that.

                        Comment


                        • #13
                          Clyde Schechter , please see if this is more clear about my issue:

                          I have a variable called region in my data set that is divided into 370 labor market areas (LMA). As each LMA has its own characteristics, I want to maintain the representativeness of each of the 370 LMA in the sample. So, I want the sample to be representative for each of the 370 labor market areas. I want to select 10% of individuals in each LMA so that the information for each individual remains available from the first to the last year, since as my data structure is a panel balanced by the individual identifier. Would you know how I can do that?

                          Comment


                          • #14
                            This is a slight variant on the approach above. There are several ways to do this. Here's one:

                            Code:
                            set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
                            gen double shuffle = runiform()
                            egen flag = tag(individual)
                            by flag LMA (shuffle), sort: gen byte keeper = _n >= 0.9*_N
                            by individual (flag), sort: replace keeper = keeper[_N]
                            keep if keeper
                            This is a variant of Nick Cox's approach in #7.

                            In these result, each LMA will be represented by 10% of the individuals within it, and every individual selected will retain all his/her observations.

                            Note: Untested because no sample data was provided. Beware of typos or other errors.

                            In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

                            Comment


                            • #15
                              Clyde Schechter your code in #14 does exactly what I asked for. But, I didn't express myself in a way I could get what I'm looking for. I need to select from person id a 10% randomized and stratififed sample among the 370 categories that constitute the Region variable. By doing that I expect to generate weights so, afterwards, I can run the regression using these weights. Any help in this I'd be pretty much pleased!


                              ----------------------- copy starting from the next line -----------------------
                              Code:
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input double personid int Region byte(Education age Regionsize)
                              10000521121 250 4 53 3
                              10000521121 250 4 54 3
                              10000521121 215 4 55 3
                              10000521121 251 4 56 1
                              10000521121 255 4 57 3
                              10000521121 255 4 58 2
                              10000521121 255 4 59 3
                              10000521121 255 3 60 1
                              10000521121 255 3 61 3
                              10000523337 215 4 52 3
                              10000523337 215 4 53 3
                              10000523337 215 2 54 3
                              10000523337 217 3 55 3
                              10000523337 215 4 56 3
                              10000523337 215 4 57 3
                              10000523337 216 4 58 3
                              10000523337 215 4 59 3
                              10000523337 215 4 60 3
                              10000525461 255 4 57 3
                              10000525461 370 4 58 3
                              end
                              ------------------ copy up to and including the previous line ------------------

                              Listed 20 out of 31308255 observations

                              .

                              Comment

                              Working...
                              X