Hi,
I have two large datasets of diabetes patients receiving care, each with 600,000(master data) and 700,000 (using data) observations to merge. Command merge can't be used here because I do not have a unique ID in using data to uniquely identify the observations in master data. Thus, I have few variables to match that includes DOB, age, sex, ethnicity, facility and date of diagnosis. The reason to merge here is to include smoking variable from the using dataset into the master dataset. Master data is a longitudinal data while using data is a registry data where the values of the variables are entered into the system only once.
I have tried using reclink command, but it only came out with 12,000 perfect matches and going through 647615 observation to assess fuzzy matches, each .=5% complete that ran whole night without a single dot. Can this command handle these large dataset and need to wait longer or I'm actually doing it wrong?
here is my command:
reclink demo_facility demo_dob demo_age demo_sex demo_ethnic diab_diagdate using ndr_general_1_1_smoking_20170313, idm(idmaster) idu(idusing) wmatch(2 5 5 2 8 4) _merge(_merge) orblock(demo_facility demo_dob demo_age demo_sex demo_ethnic diab_diagdate) gen(myscore) minscore(0.8)
I have also tried match it, and it says that I have specified too many variables. and wasn't quite understand on how to use that command.
I have actually read through all posts on STATA reclink matchit that I could possible find but still it seems like I really have no clue on how to proceed. I would really appreciate your comments and plus I am a very beginner in STATA.
Thanks
Eliana
I have two large datasets of diabetes patients receiving care, each with 600,000(master data) and 700,000 (using data) observations to merge. Command merge can't be used here because I do not have a unique ID in using data to uniquely identify the observations in master data. Thus, I have few variables to match that includes DOB, age, sex, ethnicity, facility and date of diagnosis. The reason to merge here is to include smoking variable from the using dataset into the master dataset. Master data is a longitudinal data while using data is a registry data where the values of the variables are entered into the system only once.
I have tried using reclink command, but it only came out with 12,000 perfect matches and going through 647615 observation to assess fuzzy matches, each .=5% complete that ran whole night without a single dot. Can this command handle these large dataset and need to wait longer or I'm actually doing it wrong?
here is my command:
reclink demo_facility demo_dob demo_age demo_sex demo_ethnic diab_diagdate using ndr_general_1_1_smoking_20170313, idm(idmaster) idu(idusing) wmatch(2 5 5 2 8 4) _merge(_merge) orblock(demo_facility demo_dob demo_age demo_sex demo_ethnic diab_diagdate) gen(myscore) minscore(0.8)
I have also tried match it, and it says that I have specified too many variables. and wasn't quite understand on how to use that command.
I have actually read through all posts on STATA reclink matchit that I could possible find but still it seems like I really have no clue on how to proceed. I would really appreciate your comments and plus I am a very beginner in STATA.
Thanks
Eliana
Comment