Efficiently draw subset of data

Henry Strawforrd

Join Date: Sep 2021

Posts: 228
#1

Efficiently draw subset of data

16 Nov 2023, 15:16

I have a dataset which is too large to load completely in Stata and want to draw a subsample on a condition, basically people should have at least one spell == X in a specific year.

Thats why the normal

Code:

use vars using data if spell==X

doesnt work because I want all spells of the person with that condition, not just the spell for which the condition is true.

The easy but inefficient way would be to load the person ids of people who satisfy the condition and then merge them onto to the entire dataset. But that would require loading the whole dataset. An alternative would be to write a loop and merge the person id's over slices of the original data. That could be feasible.

Anyway, are there more clever ways that I am missing to deal with this?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30150
#2

16 Nov 2023, 16:23

The easy but inefficient way would be to load the person ids of people who satisfy the condition and then merge them onto to the entire dataset. But that would require loading the whole dataset.

No, that doesn't require loading the whole data set.

Code:

use person_id if spell == X using dataset, clear duplicates drop merge 1:m using dataset, keep(match) nogenerate

will get exactly what you are asking for and never load anything that isn't part of the end result.
1 like
Comment
Henry Strawforrd

Join Date: Sep 2021

Posts: 228
#3

16 Nov 2023, 16:27

Oh okay, thats great, thanks!
Comment

Announcement

Efficiently draw subset of data

Comment

Comment