Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to merge in a dataset with duplicate ID's

    Hi everyone,

    This is my first post to Statalist, so please forgive any errors. I have a large, registry based dataset (1 million+ observations) that has all information regarding pregnancies for a large European country. Separately, I have a dataset including all patient visits to a health practitioner, including diagnoses. I would like to combine these two datasets to have all diagnoses linked to each mother in the first dataset.

    The problem that I am running into is that both datasets have multiple entries for the same ID number. The pregnancy registry has multiple pregnancies per woman, and the patient visit dataset includes all visits for each ID number. When I try to merge using 1:1, m:1, or 1:m using ID number I get a notification that the ID number does not uniquely identify observations (true for both datasets). Following the advice of others on Statalist, I used the joinby command, which seems to have linked my datasets properly, but now includes many more observations than I am looking for.

    I am hoping to keep each pregnancy as a separate observation, but in this new set combined with joinby, there are multiple observations per ID number that correspond to separate patient visits. Please see the example below:

    Dataset 1:

    ID Number Year (of birth) Var1 Var 2 Var 3 Var4 ....
    1 2008 1 1 1 2
    2 2009 3 1 2 1
    3 2008 2 1 3 1
    4 2009 1 2 2 2
    5 2007 2 3 1 1
    5 2010 3 4 1 1
    6 2007 2 2 2 2

    Dataset 2:
    ID Number Year (of diagnosis) Diagnosis
    1 2008 1
    1 2008 3
    1 2008 2
    2 2009 1
    2 2009 2
    3 2010 3
    3 2010 2

    I would like the data to look like this:

    ID Number Year (of birth) Var1 Var 2 Var 3 Var4 Diagnosis 1 Diagnosis 2 Diagnosis 3...
    1 2008 1 1 1 2 1 3 2
    2 2009 3 1 2 1 1 2 .
    3 2008 2 1 3 1 3 2 .
    4 2009 1 2 2 2 . . .
    5 2007 2 3 1 1 . . .
    5 2010 3 4 1 1 . . .
    6 2007 2 2 2 2 . . .


    Please let me know if you have any advice!

    Thank you, Nicole

  • #2
    So, your problem is that you now have every pregnancy a woman has ever had paired up with every visit she ever made. You need to narrow that down. I'm not sure what the best way is to do that given the data. You need to decide to what extent data proximity corresponds to relevant connection with a visit. This is very difficult to do with the dates being given only as years. If we had (estimated) start and (actual) end dates for the pregnancies, and exact dates for the visits, you would, I suppose, want to pair up each visit with the single pregnancy whose dates surround the visit date. (I may be supposing wrong: perhaps a visit that occurred within 30 or 90 days after the end of the pregnancy might be appropriate to pair with the pregnancy, especially if it had to do with postpartum checkup or treatment of postpartum difficulties.) But even if I'm supposing wrong, some sort of specification could be made with reasonable confidence, and that could be implemented using -rangejoin- (from SSC; requires also installing -rangestat-, also from SSC). Or you could just use -joinby- id- as you have already done and follow that with a -keep if- command that imposed the required level of date proximity.

    But with only years to go on, it's really hard. If a woman was pregnant from, say, October 2000 through July 2001, a visit in 2000 or 2001 might or might not be relatable to that pregnancy, depending on the visit month. Moreover, though it would happen infrequently, it is possible for a woman to have two pregnancies in the same year, or even more if the pregnancies are not carried to term. And in a data set the size you speak of, it almost surely will happen. So if a woman has 2 or more pregnancies in 2001, any particular visit in 2001 can only be relevant to one of those, I would suppose, and there is no way to tell which.

    In short, this is not a Stata problem. Your data is simply too coarse grained and is not fit for purpose.
    Last edited by Clyde Schechter; 27 Oct 2023, 11:53.

    Comment


    • #3
      (My comments crossed with Clyde's good advice. I did not attend to the analytic difficulties he describes, but they seem quite relevant to me. My comments therefore are strictly on the Stata aspects of what I *think* you are doing, and might be helpful as general advice about data manipulation.)

      You want your data to end up in what is known in Stata as the "wide" format, i.e., with repeated measures on a variable ("diagnosis" in your case) all appearing on a single observation. You can do this with the -reshape wide- command and possible some other manipulations. (See -help reshape- for some background.)

      You didn't show an example of what your data set is like after your -joinby- command, so giving you exact code with correct variable names would be more tedious and require more explanation. If you show us an example of what your data set is like after the -joinby-, giving you good help should be easy. (You'll want to show such an example data set using the -dataex- command, as prescribed in the StataList FAQ for new members.)

      (Also: If you had reshaped your diagnosis data set into the wide format before merging, that would have been another approach, which would likely have obviated your current problem.)

      All this being said: The "long" format, which you apparently (?) have now, is in general better for most (but not all) analyses in Stata. If you describe how you want to use the diagnosis variables in analyzing the data, that might lead to better advice re long vs. wide format.
      Last edited by Mike Lacy; 27 Oct 2023, 11:56.

      Comment


      • #4

        Hi Clyde and Mike, thanks so much for your help. I am going to work through transforming the data into a wide format, and see if that does the trick. If not, I'll come back on with the dataex examples. Really appreciate all the help!

        Comment

        Working...
        X