Merging individual-level administrative data sets

Tarjei W. Havneraas

Join Date: Nov 2016

Posts: 136
#1

Merging individual-level administrative data sets

04 Aug 2021, 05:14

I am combining multiple administrative data sets and would be happy to receive input on a sensible approach. I cannot give a data example as I am using sensitive data, but I made up two data sets to illustrate the issue.

Say we have two individual-level data sets with data registered monthly:
Prescription drugs with values A, B, C. n = 10000.

Emergency room contacts with contact reasons. n = 2000.

Here is an illustration of the data structures I want to combine:

There are
Multiple entries for ID with possibly multiple entries in the same month.

Some ID's may be registered in the prescription data base but not the emergency room data base, and vice versa.

My main issue is this: I don't see an obvious application of merge 1:1, m:1 or 1:m with either ID or ID YM.

Merge 1:1, 1:m, and m:1 all return

Code:

variable ID does not uniquely identify observations in the using data r(459)

Merging using ID YM will result in the same error and also make no sense as we would merge dates on prescriptions with dates for emergency room contacts.

From Stata documentation and posts -merge m:m- does not seem advisable. I am currently looking at -joinby- as an alternative to -merge-, but I am not certain whether this is correct either.
Tags: merge
Ken Chui

Join Date: Aug 2014

Posts: 1058
#2

04 Aug 2021, 05:35

I am not certain whether this is correct either.

The question does not really describe what is the end goal of your merging. There are occasions that merging by id is fine, date is fine, both is fine. What do you exactly plan to create or use the data for?

In addition, while I understand that the patient data privacy issue might have deterred you from using -dataex-, but your same data may not be helpful because currently no case is matched by (id and date) together.

Just for your and our convenience, please use codes to display sample data rather than image. I've created a shell for you to further edit your data. With actual readable data codes, your question may be more likely to get answered quickly.

Code:

clear input id year month str1 rx 1 2002 1 a 1 2002 1 a 2 2003 1 b 3 2003 6 a 4 2007 1 c 4 2009 1 a end save rx, replace clear input id year month str40 er_contact 2 2002 3 chest 2 2005 5 concussion 3 2005 4 trauma 3 2004 10 concussion 5 2007 1 "skin rash" 6 2009 2 trauma end save er, replace
Comment
Tarjei W. Havneraas

Join Date: Nov 2016

Posts: 136
#3

04 Aug 2021, 06:29

Thanks for your reply!

The end goal of the merging is to examine associations between prescriptions and ER contacts using some specification of a panel data model at monthly level. I may do the analyses on yearly level, too, but I assume that's pretty easy when I figure out the merging of the monthly level data sets.

Here is a edited version of the code you included in your reply.

Code:

clear input id year month str1 rx 1 2002 1 a 1 2002 1 a 2 2003 1 b 3 2003 6 a 4 2007 1 c 4 2009 1 a end gen my_rx = ym(year,month) format my_rx %tm save rx, replace clear input id year month str40 er_contact 2 2002 3 chest 2 2005 5 concussion 3 2005 4 trauma 3 2004 10 concussion 5 2007 1 "skin rash" 6 2009 2 trauma end gen my_er = ym(year,month) format my_er %tm save er, replace use "rx.dta", clear // master
Comment

Ken Chui

Join Date: Aug 2014
Posts: 1058

04 Aug 2021, 06:43

In that case -joinby- may work. However, by default joinby does an inner merge and only shows items that are matched. If you run a joinby now your data will be empty because no one with the same ID + date exists in both data set. In order to see all the cases, you can change that into a full outer join by adding the -unmatch(both)- option. I've also changed the date variable, they need to be the same for joinby to work:

Code:

clear
input id year month str1 rx
1 2002 1 a
1 2002 1 a
2 2003 1 b
3 2003 6 a
4 2007 1 c
4 2009 1 a
end
gen mdate = ym(year,month)
format mdate %tm
drop year month
save rx, replace

clear
input id year month str40 er_contact
2 2002 3 chest
2 2005 5 concussion
3 2005 4 trauma
3 2004 10 concussion
5 2007 1 "skin rash"
6 2009 2 trauma
end
gen mdate = ym(year,month)
format mdate %tm
drop year month
save er, replace

use rx, clear
joinby id mdate using er, unmatch(both)
gsort id mdate
list, sep(0)

Results showing no match:

Code:

     +------------------------------------------------------+
     | id   rx     mdate                _merge   er_contact |
     |------------------------------------------------------|
  1. |  1    a    2002m1   only in master data              |
  2. |  1    a    2002m1   only in master data              |
  3. |  2         2002m3    only in using data        chest |
  4. |  2    b    2003m1   only in master data              |
  5. |  2         2005m5    only in using data   concussion |
  6. |  3    a    2003m6   only in master data              |
  7. |  3        2004m10    only in using data   concussion |
  8. |  3         2005m4    only in using data       trauma |
  9. |  4    c    2007m1   only in master data              |
 10. |  4    a    2009m1   only in master data              |
 11. |  5         2007m1    only in using data    skin rash |
 12. |  6         2009m2    only in using data       trauma |
     +------------------------------------------------------+

Comment

Tarjei W. Havneraas

Join Date: Nov 2016

Posts: 136
#5

05 Aug 2021, 05:19

Thanks for your helpful reply. Using a full outer join seem to provide the data set I was interested in!

Observations tripled in the combined relative to the master data set. If there are any obvious reasons for increased observations, I'd be happy to know, however, I know that data management issues are complex and at some point considerations without the actual data becomes difficult.
Comment

Ken Chui

Join Date: Aug 2014
Posts: 1058

05 Aug 2021, 06:45

Originally posted by Tarjei W. Havneraas View Post

Thanks for your helpful reply. Using a full outer join seem to provide the data set I was interested in!

Observations tripled in the combined relative to the master data set. If there are any obvious reasons for increased observations, I'd be happy to know, however, I know that data management issues are complex and at some point considerations without the actual data becomes difficult.

For full outer join, if the two data are completely unrelated the sample size will be the sum of the two (n1 + n2, like the one in #4), if the two data are fully related (e.g. only one ID, from one single month), the sample size will be the product of the two (n1 * n2). So, triple is nothing surprising. For example, if a patient was prescribe 8 drugs in August 2009, and also visited ER with three conditions, that would net 8 * 3 = 24 cases in the merged file, because every single drug that month will be matched by every sing ER visit that month.

Example:

Code:

clear
input id year month str1 rx
1 2002 1 a
1 2002 1 b
1 2002 1 c
1 2002 1 d
1 2002 1 e
1 2002 1 f
end
gen mdate = ym(year,month)
format mdate %tm
drop year month
save rx, replace

clear
input id year month str40 er_contact
1 2002 1 chest
1 2002 1 concussion
1 2002 1 trauma
1 2002 1 fracture
1 2002 1 "skin rash"
1 2002 1 hypertension
end
gen mdate = ym(year,month)
format mdate %tm
drop year month
save er, replace

use rx, clear
joinby id mdate using er, unmatch(both)
gsort id mdate
list, sep(0)

Results:

Code:

     +-----------------------------------------------------------------+
     | id   rx    mdate                          _merge     er_contact |
     |-----------------------------------------------------------------|
  1. |  1    a   2002m1   both in master and using data       fracture |
  2. |  1    a   2002m1   both in master and using data   hypertension |
  3. |  1    a   2002m1   both in master and using data      skin rash |
  4. |  1    a   2002m1   both in master and using data          chest |
  5. |  1    a   2002m1   both in master and using data     concussion |
  6. |  1    a   2002m1   both in master and using data         trauma |
  7. |  1    b   2002m1   both in master and using data      skin rash |
  8. |  1    b   2002m1   both in master and using data       fracture |
  9. |  1    b   2002m1   both in master and using data         trauma |
 10. |  1    b   2002m1   both in master and using data   hypertension |
 11. |  1    b   2002m1   both in master and using data     concussion |
 12. |  1    b   2002m1   both in master and using data          chest |
 13. |  1    c   2002m1   both in master and using data     concussion |
 14. |  1    c   2002m1   both in master and using data      skin rash |
 15. |  1    c   2002m1   both in master and using data         trauma |
 16. |  1    c   2002m1   both in master and using data          chest |
 17. |  1    c   2002m1   both in master and using data   hypertension |
 18. |  1    c   2002m1   both in master and using data       fracture |
 19. |  1    d   2002m1   both in master and using data       fracture |
 20. |  1    d   2002m1   both in master and using data   hypertension |
 21. |  1    d   2002m1   both in master and using data     concussion |
 22. |  1    d   2002m1   both in master and using data      skin rash |
 23. |  1    d   2002m1   both in master and using data         trauma |
 24. |  1    d   2002m1   both in master and using data          chest |
 25. |  1    e   2002m1   both in master and using data      skin rash |
 26. |  1    e   2002m1   both in master and using data       fracture |
 27. |  1    e   2002m1   both in master and using data     concussion |
 28. |  1    e   2002m1   both in master and using data          chest |
 29. |  1    e   2002m1   both in master and using data   hypertension |
 30. |  1    e   2002m1   both in master and using data         trauma |
 31. |  1    f   2002m1   both in master and using data       fracture |
 32. |  1    f   2002m1   both in master and using data      skin rash |
 33. |  1    f   2002m1   both in master and using data     concussion |
 34. |  1    f   2002m1   both in master and using data          chest |
 35. |  1    f   2002m1   both in master and using data         trauma |
 36. |  1    f   2002m1   both in master and using data   hypertension |
     +-----------------------------------------------------------------+

Last edited by Ken Chui; 05 Aug 2021, 06:48.

Comment

Tarjei W. Havneraas

Join Date: Nov 2016

Posts: 136
#7

06 Aug 2021, 02:36

Ok, that makes sense. Thank you for your explanation and your time spent on this issue which was very helpful to me.
Comment

Announcement