Creating a random subsample out of very large panel datasets

Anny Yu

Join Date: Dec 2017

Posts: 17
#1

Creating a random subsample out of very large panel datasets

11 Dec 2017, 15:23

I would greatly appreciate your help as I am encountering some issues with panel datasets.

I have very huge panel datasets (15-year observation period, each year with around 10 million observations). I would like to draw a 10% random subsample out of the entire sample. However, if I try to merge all files together and then assign a random number by unit of analysis, I'm afraid stata cannot smoothly process such a large amount of observations.

Is there a proper way to randomly select part of the sample year by year first before merging? But then, I don't want to just follow one entry cohort, so I'd still like to include randomly another 10% of new entries in the next year in addition to the 10% random subsample of the previous year. But I am not sure it is do-able because you might include people who are not necessarily new entry to the dataset, but just people whom you did not randomly select in the previous year.

Do you know how I can solve this problem? Many, many thanks indeed!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

11 Dec 2017, 15:51

Welcome to Statalist.

If I understand your needs correctly (and I am not certain I do) my approach would be to create, for each of your input datasets, a "selection" dataset consisting of just the panel and wave identifiers, then merge those (much smaller, in the sense of not including all the variables) selection datasets, use that merged selection dataset to select your sample, yielding a datatset with the panel identifiers for your sample, and then merge those panel identifiers back to the input datasets to select the full observations for just those panel members selected for the sample.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#3

11 Dec 2017, 16:02

Does this do what you want?

https://www.stata.com/support/faqs/d...ling-clusters/

The sample2 command it refers to (findit sample2) may or may not be easier:

net describe dm46, from(http://www.stata.com/stb/stb37)

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Anny Yu

Join Date: Dec 2017

Posts: 17
#4

15 Dec 2017, 02:28

Thank you very much for your help!

I still have one question. According to your suggestions, I still have to merge data files from all years before randomizing. If I keep just the key variables (id & year) from each year for merging, still, I'll have 12 million observations per year, and over 150 million observations after merging. Can stata afford that?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

15 Dec 2017, 06:07

See help memory for advice on Stata's capabilities. The answer to your question depends on the version of Stata you are running and the characteristics of the computer you are running it on.

On reflection, the technique I described in post #2 is unduly complicated. In post 1 you describe a 10% random subsample of the entire sample. To do that, you need only build a list of distinct panel identifiers, you do not need the wave. Select 10% from that list. To build the list, select identifiers in the first wave, merge them with identifiers from the second wave, merge them with identifiers from the third wave, and so forth.
Comment
Anny Yu

Join Date: Dec 2017

Posts: 17
#6

16 Dec 2017, 05:26

It is clear. Thank you very much for your help!
Comment

Announcement

Creating a random subsample out of very large panel datasets

Comment

Comment

Comment

Comment

Comment