I want to perform a cohort study in a large database (~8m observations) comparing drug A users (exposed) to non-drug A users (unexposed). First I need to create my cohort and I want to match the exposed and unexposed on two variables (year of birth and GP practice).
I attempted doing this by first creating a variable based on those two variables using: -egen yob_GP = group(yob GP)-. This gave me a new variable (yob_GP) with a unique code for each possibility.
My question: How do I match (1:5 in my large dataset) the exposed and unexposed based on this unique code? My plan after matching on yob and GP practice is to assign the same index_date in the exposed to non-exposed so the start of follow up starts at the same time. Then perform propensity score matching (1:3) on around 20 variables.
Sample below:
I attempted doing this by first creating a variable based on those two variables using: -egen yob_GP = group(yob GP)-. This gave me a new variable (yob_GP) with a unique code for each possibility.
My question: How do I match (1:5 in my large dataset) the exposed and unexposed based on this unique code? My plan after matching on yob and GP practice is to assign the same index_date in the exposed to non-exposed so the start of follow up starts at the same time. Then perform propensity score matching (1:3) on around 20 variables.
Sample below:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float ID str5 GP float yob int index_date byte Drug_A float yob_GP 1 "a9914" 1960 17157 1 2 2 "a9914" 1960 . 0 2 3 "a9957" 1962 . 0 3 4 "a9957" 1962 . 0 3 5 "b6642" 1954 20009 1 1 6 "b6642" 1954 . 0 1 7 "c9884" 1970 . 0 5 8 "c9884" 1970 19697 1 5 9 "c9884" 1970 . 0 5 10 "c9914" 1969 . 0 4 11 "c9914" 1969 17490 1 4 12 "c9914" 1969 . 0 4 13 "c9914" 1969 . 0 4 14 "c9995" 1972 . 0 6 15 "c9995" 1972 . 0 6 16 "c9995" 1972 . 0 6 17 "c9995" 1972 . 0 6 18 "c9995" 1972 . 0 6 19 "c9995" 1972 17441 1 6 20 "c9995" 1972 . 0 6 end format %td index_date
Comment