I have a dataset with exposed and wish to randomly select up to 5 unexposed for each exposed (without replacement) matched on sex, year, and age (+/- 5 years). Following is some example code. I'd appreciate all comments on how to best do this, but I'm particularly interested in the best way to reshape the data after range join so I have one observation per individual. There should be a variable exposed indicating exposure status and a variable pair_id that will indicate the matched sets.
My aim is to generate a matched cohort study; not a nested case-control study (i.e., risk set sampling).
I'm confident I can get from where I am to where I want to be, but I'm thinking there may be a better approach to the one I am taking.
My aim is to generate a matched cohort study; not a nested case-control study (i.e., risk set sampling).
I'm confident I can get from where I am to where I want to be, but I'm thinking there may be a better approach to the one I am taking.
Code:
use http://pauldickman.com/software/stata/exposed, clear // For each observation in exposed, select all unexposed // with same sex and year of diagnosis with age +/- 5 years rangejoin age -5 5 using http://pauldickman.com/software/stata/unexposed, by(sex yydx) // randomly select 5 unexposed if there are more than 5 matches set seed 8675309 gen double shuffle = runiform() by id (shuffle), sort: keep if _n <= 5 drop shuffle // reshape from wide format to long format rename age age1 rename status status1 rename dx dx1 rename exit exit1 rename age_U age2 rename status_U status2 rename dx_U dx2 rename exit_U exit2 reshape long age status dx exit, i(id id_U) j(exp)
Comment