Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging individual-level administrative data sets

    I am combining multiple administrative data sets and would be happy to receive input on a sensible approach. I cannot give a data example as I am using sensitive data, but I made up two data sets to illustrate the issue.

    Say we have two individual-level data sets with data registered monthly:
    1. Prescription drugs with values A, B, C. n = 10000.
    2. Emergency room contacts with contact reasons. n = 2000.
    Here is an illustration of the data structures I want to combine:

    Click image for larger version

Name:	merge2.png
Views:	1
Size:	35.4 KB
ID:	1621925


    There are
    1. Multiple entries for ID with possibly multiple entries in the same month.
    2. Some ID's may be registered in the prescription data base but not the emergency room data base, and vice versa.
    My main issue is this: I don't see an obvious application of merge 1:1, m:1 or 1:m with either ID or ID YM.

    Merge 1:1, 1:m, and m:1 all return
    Code:
    variable ID does not uniquely identify observations in the using data
    r(459)
    Merging using ID YM will result in the same error and also make no sense as we would merge dates on prescriptions with dates for emergency room contacts.

    From Stata documentation and posts -merge m:m- does not seem advisable. I am currently looking at -joinby- as an alternative to -merge-, but I am not certain whether this is correct either.

  • #2
    I am not certain whether this is correct either.
    The question does not really describe what is the end goal of your merging. There are occasions that merging by id is fine, date is fine, both is fine. What do you exactly plan to create or use the data for?

    In addition, while I understand that the patient data privacy issue might have deterred you from using -dataex-, but your same data may not be helpful because currently no case is matched by (id and date) together.

    Just for your and our convenience, please use codes to display sample data rather than image. I've created a shell for you to further edit your data. With actual readable data codes, your question may be more likely to get answered quickly.

    Code:
    clear
    input id year month str1 rx
    1 2002 1 a
    1 2002 1 a
    2 2003 1 b
    3 2003 6 a
    4 2007 1 c
    4 2009 1 a
    end
    save rx, replace
    
    clear
    input id year month str40 er_contact
    2 2002 3 chest
    2 2005 5 concussion
    3 2005 4 trauma
    3 2004 10 concussion
    5 2007 1 "skin rash"
    6 2009 2 trauma
    end
    save er, replace

    Comment


    • #3
      Thanks for your reply!

      The end goal of the merging is to examine associations between prescriptions and ER contacts using some specification of a panel data model at monthly level. I may do the analyses on yearly level, too, but I assume that's pretty easy when I figure out the merging of the monthly level data sets.

      Here is a edited version of the code you included in your reply.
      Code:
      clear
      input id year month str1 rx
      1 2002 1 a
      1 2002 1 a
      2 2003 1 b
      3 2003 6 a
      4 2007 1 c
      4 2009 1 a
      end
      gen my_rx = ym(year,month)
      format my_rx %tm
      save rx, replace
      
      clear
      input id year month str40 er_contact
      2 2002 3 chest
      2 2005 5 concussion
      3 2005 4 trauma
      3 2004 10 concussion
      5 2007 1 "skin rash"
      6 2009 2 trauma
      end
      gen my_er = ym(year,month)
      format my_er %tm
      save er, replace
      
      use "rx.dta", clear // master

      Comment


      • #4
        In that case -joinby- may work. However, by default joinby does an inner merge and only shows items that are matched. If you run a joinby now your data will be empty because no one with the same ID + date exists in both data set. In order to see all the cases, you can change that into a full outer join by adding the -unmatch(both)- option. I've also changed the date variable, they need to be the same for joinby to work:

        Code:
        clear
        input id year month str1 rx
        1 2002 1 a
        1 2002 1 a
        2 2003 1 b
        3 2003 6 a
        4 2007 1 c
        4 2009 1 a
        end
        gen mdate = ym(year,month)
        format mdate %tm
        drop year month
        save rx, replace
        
        clear
        input id year month str40 er_contact
        2 2002 3 chest
        2 2005 5 concussion
        3 2005 4 trauma
        3 2004 10 concussion
        5 2007 1 "skin rash"
        6 2009 2 trauma
        end
        gen mdate = ym(year,month)
        format mdate %tm
        drop year month
        save er, replace
        
        use rx, clear
        joinby id mdate using er, unmatch(both)
        gsort id mdate
        list, sep(0)
        Results showing no match:

        Code:
             +------------------------------------------------------+
             | id   rx     mdate                _merge   er_contact |
             |------------------------------------------------------|
          1. |  1    a    2002m1   only in master data              |
          2. |  1    a    2002m1   only in master data              |
          3. |  2         2002m3    only in using data        chest |
          4. |  2    b    2003m1   only in master data              |
          5. |  2         2005m5    only in using data   concussion |
          6. |  3    a    2003m6   only in master data              |
          7. |  3        2004m10    only in using data   concussion |
          8. |  3         2005m4    only in using data       trauma |
          9. |  4    c    2007m1   only in master data              |
         10. |  4    a    2009m1   only in master data              |
         11. |  5         2007m1    only in using data    skin rash |
         12. |  6         2009m2    only in using data       trauma |
             +------------------------------------------------------+

        Comment


        • #5
          Thanks for your helpful reply. Using a full outer join seem to provide the data set I was interested in!

          Observations tripled in the combined relative to the master data set. If there are any obvious reasons for increased observations, I'd be happy to know, however, I know that data management issues are complex and at some point considerations without the actual data becomes difficult.

          Comment


          • #6
            Originally posted by Tarjei W. Havneraas View Post
            Thanks for your helpful reply. Using a full outer join seem to provide the data set I was interested in!

            Observations tripled in the combined relative to the master data set. If there are any obvious reasons for increased observations, I'd be happy to know, however, I know that data management issues are complex and at some point considerations without the actual data becomes difficult.
            For full outer join, if the two data are completely unrelated the sample size will be the sum of the two (n1 + n2, like the one in #4), if the two data are fully related (e.g. only one ID, from one single month), the sample size will be the product of the two (n1 * n2). So, triple is nothing surprising. For example, if a patient was prescribe 8 drugs in August 2009, and also visited ER with three conditions, that would net 8 * 3 = 24 cases in the merged file, because every single drug that month will be matched by every sing ER visit that month.

            Example:
            Code:
            clear
            input id year month str1 rx
            1 2002 1 a
            1 2002 1 b
            1 2002 1 c
            1 2002 1 d
            1 2002 1 e
            1 2002 1 f
            end
            gen mdate = ym(year,month)
            format mdate %tm
            drop year month
            save rx, replace
            
            clear
            input id year month str40 er_contact
            1 2002 1 chest
            1 2002 1 concussion
            1 2002 1 trauma
            1 2002 1 fracture
            1 2002 1 "skin rash"
            1 2002 1 hypertension
            end
            gen mdate = ym(year,month)
            format mdate %tm
            drop year month
            save er, replace
            
            use rx, clear
            joinby id mdate using er, unmatch(both)
            gsort id mdate
            list, sep(0)
            Results:
            Code:
                 +-----------------------------------------------------------------+
                 | id   rx    mdate                          _merge     er_contact |
                 |-----------------------------------------------------------------|
              1. |  1    a   2002m1   both in master and using data       fracture |
              2. |  1    a   2002m1   both in master and using data   hypertension |
              3. |  1    a   2002m1   both in master and using data      skin rash |
              4. |  1    a   2002m1   both in master and using data          chest |
              5. |  1    a   2002m1   both in master and using data     concussion |
              6. |  1    a   2002m1   both in master and using data         trauma |
              7. |  1    b   2002m1   both in master and using data      skin rash |
              8. |  1    b   2002m1   both in master and using data       fracture |
              9. |  1    b   2002m1   both in master and using data         trauma |
             10. |  1    b   2002m1   both in master and using data   hypertension |
             11. |  1    b   2002m1   both in master and using data     concussion |
             12. |  1    b   2002m1   both in master and using data          chest |
             13. |  1    c   2002m1   both in master and using data     concussion |
             14. |  1    c   2002m1   both in master and using data      skin rash |
             15. |  1    c   2002m1   both in master and using data         trauma |
             16. |  1    c   2002m1   both in master and using data          chest |
             17. |  1    c   2002m1   both in master and using data   hypertension |
             18. |  1    c   2002m1   both in master and using data       fracture |
             19. |  1    d   2002m1   both in master and using data       fracture |
             20. |  1    d   2002m1   both in master and using data   hypertension |
             21. |  1    d   2002m1   both in master and using data     concussion |
             22. |  1    d   2002m1   both in master and using data      skin rash |
             23. |  1    d   2002m1   both in master and using data         trauma |
             24. |  1    d   2002m1   both in master and using data          chest |
             25. |  1    e   2002m1   both in master and using data      skin rash |
             26. |  1    e   2002m1   both in master and using data       fracture |
             27. |  1    e   2002m1   both in master and using data     concussion |
             28. |  1    e   2002m1   both in master and using data          chest |
             29. |  1    e   2002m1   both in master and using data   hypertension |
             30. |  1    e   2002m1   both in master and using data         trauma |
             31. |  1    f   2002m1   both in master and using data       fracture |
             32. |  1    f   2002m1   both in master and using data      skin rash |
             33. |  1    f   2002m1   both in master and using data     concussion |
             34. |  1    f   2002m1   both in master and using data          chest |
             35. |  1    f   2002m1   both in master and using data         trauma |
             36. |  1    f   2002m1   both in master and using data   hypertension |
                 +-----------------------------------------------------------------+
            Last edited by Ken Chui; 05 Aug 2021, 06:48.

            Comment


            • #7
              Ok, that makes sense. Thank you for your explanation and your time spent on this issue which was very helpful to me.

              Comment

              Working...
              X