How to take a random sample of panel data and keep all person-year-observations for a particular ID

Marco Kuehne

Join Date: Feb 2019

Posts: 32
#1

How to take a random sample of panel data and keep all person-year-observations for a particular ID

15 Jul 2019, 10:02

I try to take a random sample from a huge unbalanced panel dataset. For the MWE data, I would like to randomnly choose either 513 or 514. But whatever ID is picked at random it should keep all year-data from that person. I call it a 'random panel sample'. I havn't found anything in the

Code:

sample

documentation.

Code:

clear input year pid var 2003 513 1500 2004 513 1550 2005 513 1500 2006 513 1600 2003 514 1600 2004 514 1600 2005 514 1700 2006 514 1800 end

I tried to combine

Code:

sample

with

Code:

bysort

like

Code:

bysort pid (year): sample 1

but it always drops all observations for me. Thank you very much.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30101

15 Jul 2019, 10:48

Code:

tempfile holding
save `holding'

keep pid
duplicates drop

set seed 1234
sample 1, count

merge 1:m pid using `holding', assert(match using) keep(match) nogenerate

Comment

Marco Kuehne

Join Date: Feb 2019

Posts: 32
#3

15 Jul 2019, 11:00

That is absolutely perfect! And I'm kind of glad that I didn't just missed one simple command. This will be very valuable for my analysis.
Comment
Mads Moring

Join Date: Apr 2017

Posts: 44
#4

01 Sep 2020, 07:26

Hi guys,

Is it possible to write the code within a loop? If you want perhaps 10 different random samples?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

01 Sep 2020, 12:02

Yes, but in what sense do you "want perhaps 10 different random samples?" Do you want to save them in 10 separate data sets, or have them all appended in one data set. Or perhaps you want to generate them one at a time and do some analysis on each, and then do something with those analytic results? Be more specific.
Comment
Mads Moring

Join Date: Apr 2017

Posts: 44
#6

02 Sep 2020, 02:16

You're absolutely right - let me try to be more clear.

I too have panel data, and want to investigate a change over time in an outcome. I'm planning to estimate what Singer & Willet (2003) calls a "multilevel model for change". Thus, I first want to eyeball whether a linear model seems to be a right functional form at the individual level.

What I want is 10 graphs of 10 different random subsamples. I can run it 10 times with different seed, but perhaps a loop could do the job more efficiently? Especially if one wanted 100 graphs.

Code:

forvalues i=1/10 { use "data.dta", replace tempfile holding save `holding' keep ID duplicates drop set seed 1234`i' // This seed needs to be changed everytime the code is run to get a different "random" subsample sample 25, count merge 1:m ID using `holding', assert (match using) keep(match) nogenerate twoway (scatter outcome time) (lfit outcome time), by (ID) graph save "Graph`i'" }

Is this how you would solve it too, Clyde?

Last edited by Mads Moring; 02 Sep 2020, 02:32.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35699

02 Sep 2020, 07:07

You don't need any file choreography here. You can do it in place.

tag each panel once

foreach iteration {
select a random sample of the tagged values
spread selection to other observations in panel
do whatever to selection
}

Here is a start on a framework. Concretely there are 20 panels; I select 7 of them in 10 samples.

Code:

webuse grunfeld, clear

egen tag = tag(company)

set seed 2803

gen shuffle = .  
gen sampled = . 

forval j = 1/10 { 
    qui replace shuffle = runiform()
    sort tag shuffle 

   * the 20 tagged values are at the end; we select 7 of them 
    qui replace sampled = inrange(_n, _N-6, _N)

    * spread selection to each panel 
    qui bysort company (sampled) : replace sampled = sampled[_N]

    levelsof company if sampled 

     * here is what you do something to the sample 
}

Comment

Jabir Rahman

Join Date: Dec 2020

Posts: 1
#8

27 Dec 2020, 21:10

Hello Everyone, I have a similar problem Marco. Only that my data is divided into 4 groups and I would like to randomly select 25% of data from each group. But for an ID (selected under a group), I need to keep all rows for that ID (i.e. the code should randomly select an ID under a group but all information about that ID). Your help is highly appreciated.
Comment
Jefferson Pereira

Join Date: Dec 2020

Posts: 23
#9

24 Jan 2021, 17:06

Hello, Clyde Schechter !

Can I use the same command you provided in #8 for the following situation: I want to select a sample in a balanced panel data in a way I keep the representativeness of the variable "region" and also keep the panel balanced with information of the person id for each year of the panel.

Thanks in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#10

24 Jan 2021, 18:00

I'm not sure what you are referring to. I did not write #8, and there is no code in it either. Nick Cox' cod in #7 will fulfill your need, though if you are only looking for a single random sample there is no need for the -forvalues j = 1/10- loop.
Comment
Jefferson Pereira

Join Date: Dec 2020

Posts: 23
#11

24 Jan 2021, 18:19

I'm sorry fot that Clyde Schechter , I just wrote wrong, I wanted to say in #2. I just tried this code:
local fraction = 0.10
local first = 2010
gen rand = runiform()
bysort year (rand): gen byte flag = (_n <= (0.10 * _N)) & (year == 2010)
egen keeper = max(flag), by(id)

But, what I need is a random and stratified sample for the variable region in a way I can keep the id chosen by this process for the whole period, keeping the panel structure.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#12

24 Jan 2021, 18:26

The code you show in #11 will choose a 10% random sample, and any id that is included in the sample will have all of its observations included. The code in #2 will choose a 1% random sample of pid's and also includes all observations for any pid that is included. As you can see in this thread, there are many ways of going about this.

However, this is a simple random sample of id's, not a stratified sample. You don't say how you want to stratify, so I can't guide you in how to go about that.
1 like
Comment
Jefferson Pereira

Join Date: Dec 2020

Posts: 23
#13

24 Jan 2021, 19:11

Clyde Schechter , please see if this is more clear about my issue:

I have a variable called region in my data set that is divided into 370 labor market areas (LMA). As each LMA has its own characteristics, I want to maintain the representativeness of each of the 370 LMA in the sample. So, I want the sample to be representative for each of the 370 labor market areas. I want to select 10% of individuals in each LMA so that the information for each individual remains available from the first to the last year, since as my data structure is a panel balanced by the individual identifier. Would you know how I can do that?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#14

24 Jan 2021, 21:31

This is a slight variant on the approach above. There are several ways to do this. Here's one:

Code:

set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle = runiform() egen flag = tag(individual) by flag LMA (shuffle), sort: gen byte keeper = _n >= 0.9*_N by individual (flag), sort: replace keeper = keeper[_N] keep if keeper

This is a variant of Nick Cox's approach in #7.

In these result, each LMA will be represented by 10% of the individuals within it, and every individual selected will retain all his/her observations.

Note: Untested because no sample data was provided. Beware of typos or other errors.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Jefferson Pereira

Join Date: Dec 2020

Posts: 23
#15

25 Jan 2021, 18:40

Clyde Schechter your code in #14 does exactly what I asked for. But, I didn't express myself in a way I could get what I'm looking for. I need to select from person id a 10% randomized and stratififed sample among the 370 categories that constitute the Region variable. By doing that I expect to generate weights so, afterwards, I can run the regression using these weights. Any help in this I'd be pretty much pleased!

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double personid int Region byte(Education age Regionsize) 10000521121 250 4 53 3 10000521121 250 4 54 3 10000521121 215 4 55 3 10000521121 251 4 56 1 10000521121 255 4 57 3 10000521121 255 4 58 2 10000521121 255 4 59 3 10000521121 255 3 60 1 10000521121 255 3 61 3 10000523337 215 4 52 3 10000523337 215 4 53 3 10000523337 215 2 54 3 10000523337 217 3 55 3 10000523337 215 4 56 3 10000523337 215 4 57 3 10000523337 216 4 58 3 10000523337 215 4 59 3 10000523337 215 4 60 3 10000525461 255 4 57 3 10000525461 370 4 58 3 end

------------------ copy up to and including the previous line ------------------

Listed 20 out of 31308255 observations

.
Comment

Announcement

How to take a random sample of panel data and keep all person-year-observations for a particular ID

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment