Merging m:m - it seems to work for me. Why critise? Appreciate your feedback

Denise Vella

Join Date: Aug 2022

Posts: 187
#1

Merging m:m - it seems to work for me. Why critise? Appreciate your feedback

15 Dec 2022, 13:46

As you can see in my previous posts. I currently have one huge dataset called procedure.

This shows each time a patient had an operation indicated with procedureno.
If the patient had a R or subsequently L operation it is coded with a unique indexcode (but may have the same procedureno as it may have been done in the same sitting.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(procedureno indexcode new_dvt new_mi pe sidep genderp date) 200 15 1 0 0 2 1 22798 888 9 . 1 1 1 1 22646 999 13 1 1 . 2 2 22706 999 15 1 0 . 2 2 22767 999 16 1 0 1 2 1 22798 999 10 1 1 1 1 2 22678 end format %td date label values sidep side label def side 1 "R", modify label def side 2 "L", modify label values genderp gender label def gender 1 "M", modify label def gender 2 "F", modify

I then have a hospital dataset. Which shows each time the patient came to hospital for any reason. //The only unique identifier between hospital dataset & procedure dataset is the procedureno //

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(procedureno ssi dehiscence pe) str3 uniqueno float(admissiondate gender side) 888 0 1 1 "1A" 22645 1 1 888 1 1 1 "2A" 22690 1 1 999 1 1 1 "3A" 22705 2 1 999 1 0 1 "4A" 22737 2 2 end format %td admissiondate label values gender gender label def gender 1 "M", modify label def gender 2 "F", modify label values side side label def side 1 "R", modify label def side 2 "L", modify

A.
Please bear in mind I have already screened the hospital dataset and am working with a subset of patients which are duplicates - trying to identify which to keep.

Now as the proceduredata set is very big I would like to merge using the hospitaldataset as the master dataset & procedure dataset as the using dataset.
Unique identifier to merge - will be procedureno

And from the proceduredata I onyl would like to keep genderp (gender) , sidep (side of surgery), date which is the operation date.

Thus, knowing there are duplicates in my hospitaldataset + duplicates in my proceduredataset (as patients who have the same procedureno may have had a R + L operation) I need to merge using the following code:

merge procedureno using "file location", keep (genderp sidep date)

Result:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(procedureno ssi dehiscence pe) str3 uniqueno float(admissiondate gender side sidep genderp date) byte _merge 888 0 1 1 "1A" 22645 1 1 1 1 22646 3 888 1 1 1 "2A" 22690 1 1 1 1 22646 3 999 1 1 1 "3A" 22705 2 1 1 2 22678 3 999 1 0 1 "4A" 22737 2 2 2 2 22706 3 200 . . . "" . . . 2 1 22798 2 999 1 0 1 "4A" 22737 2 2 2 1 22798 3 999 1 0 1 "4A" 22737 2 2 2 2 22767 3 end format %td admissiondate format %td date label values sidep side label def side 1 "R", modify label def side 2 "L", modify label values genderp gender label def gender 1 "M", modify label def gender 2 "F", modify

1. The 'extra ones from proceduredata which remains unmatched will appear as missing in the variables ssi, dehiscence, pe, uniqueno etc as seen in procedureno 200. Which I am aware will happen as I have already removed the unique values from the hospitaldata set (as explained in A)

Therefore:
//Find the variables that do not have observations within the hospital dataset apart from those variables that you have matched on

generate miss = missing(genderp, gender, ssi, dehiscence, pe, uniqueno)
drop if miss == 1

//fIND the patients that are the same patients - ie THE pt has the procedureno + same gender identified from hospitaldataset + gender from proceduredataset
duplicates tag genderp gender procedureno, gen(duplicates)

//Values 0 if gender from hospitaldataset + gender from proceduredataset are different
drop if duplicates == 0

//Focus on those that are the same patients (same gender + same side) and pick the ones with the least days from the admissiondate
gen difference = date - admissiondate
keep if difference <= 70 & sidep == side

*If a patient has had a revision operation of the same side, this should not normally be within 70 days hence the reason why I have used 70 days. In fact I will change this to perhaps 5 days.

This merge to many seems to work for me. Now of course I'm working on a dummy dataset as I want to see how it works. But why so much critism? I don't see another way round it.
I can't use append, nor can I use 1:m or m:1 at this time as none of unique values, this is the the most consice dataset I have managed to get it too.

Im working on duplicates with the aim of creating unique values.

I will then merge again using m: 1
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10186
#2

15 Dec 2022, 14:38

Have a look at https://journals.sagepub.com/doi/ful...6867X211063416.
3 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

15 Dec 2022, 15:03

For those who at some later point are directed to this topic by a search engine, and have not seen the earlier topics referred but not linked to in post #1, this earlier post of mine suggests alternatives to merge m:m, of which joinby seems to be the most likely to be helpful here.

It also explains that the merge command used in the code in post #1 uses old syntax for the merge command that was superseded in Stata 11, and it is effectively identical to merge m:m using the current syntax.

Last edited by William Lisowski; 15 Dec 2022, 15:09.
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#4

15 Dec 2022, 15:27

This merge to many seems to work for me. Now of course I'm working on a dummy dataset as I want to see how it works. But why so much critism? I don't see another way round it.

It seems to work for you. But if you go to your real data and scrutinize the results you get you will see that -merge m:m- is pairing up things that have nothing to do with each other, and probably also missing pairing up things that should be paired with each other. Yes, you are getting results, but they are wrong results that make arbitrary and meaningless pairings between observations in the two data sets.

There is a way around it. -joinby- is the appropriate way to do this, as William Lisowski and Andrew Musau have already pointed out.

If the patient had a R or subsequently L operation it is coded with a unique indexcode (but may have the same procedureno as it may have been done in the same sitting.

If true, this would suggest that the best approach would be to combine the two observations into a single one in some way, and then you could do a 1:m or m:1 merge. But this doesn't seem to actually be true in the example data you show. For example, in the procedure example data, procedure #999 appears 4 times (so clearly not just an R and L), and these four occasions span a date range of 120 days, not the same sitting.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#5

15 Dec 2022, 15:40

Some have said joinby (and I agree!!!!!) is the real m:m. Anyways, I second the advice above. Oh, and StataCorp literally advises us to not use m:m, to the point they won't even give an example for it. So, I won't argue with StataCorp on that issue!
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#6

15 Dec 2022, 15:53

Let me go out on a limb here and speculate about what your data really are. I'm guessing that your procedures data set is taken from the National Surgery Quality Improvement Program (NSQIP) database and that what you are calling "procedureno" is not actually a distinct identifier for procedures but is just a CPT code (Current Procedural Terminology) designating the type of procedure. Your other data base is a data set of hospitalizations, which also has CPT codes, and "uniqueno" which is a unique patient identifier, as well as a date (which might be admission date or discharge date). By contrast the date in the procedures data set is going to be the actual date of the procedure, which will, in general, be close to, but not the same as, the date in the hospitalizations database. And you are trying to pair up each observation in the procedures database to the corresponding hospitalization. That's going to be very difficult to do, especially since the procedure data set does not identify the patient.

If I'm on the right track here, then the most effective way to do this will actually be with Robert Picard's -rangejoin- command, available from SSC. Let me assume that the hospitalizations data set contains the admission date and the discharge date, so that you want to match hospitalizations with procedures whose dates fall between admission and discharge. Let's call those two date variable admissiondate, and dischargedate, respectively. Now, there is potential for spurious matches to be found. Two people in your hospital might have had the same procedure done, even on the same date, and these variables alone will not distinguish which is which. You can narrow it down a bit because your procedure database has a gender variable, which would have to match the genera variable in the hospitalizations data base. And I'll bet if you redid your data pull you could also get age in both data sets. Then you would do something like this:

Code:

use hospitalizations_data_set, clear rename gender genderp // TO MATCH THE OTHER DATA SET rangejoin date admissiondate dischargedate, by(procedureno genderp)

This will pair up each hospitalization observation with any and all observations of the procedure data set that have the same procedureno and where the patients are of the same sex. To try to whittle it down further, you can then drop any pairs where the ages differ by more than the duration of the admission + 1 year. You might be able to narrow things down a bit more by using other clues in the data. For example, race and ethnicity are probably available (although they may be ascertained differently in the two data sets, and so not entirely reliable). Also, I notice that there is another variable, pe, that appears in both data sets. If it means the same in both, and if they ought to match, then you could narrow things down even more by including that pe variable in the -by()- option of the -rangejoin- command.

In the end, there will probably be some cases that you cannot disambiguate and you may have to just discard those hospitalizations, or choose from among the eligible matches at random.

If, in fact, discharge dates are not available, it is harder. In that case, all you can really be sure is that the procedure date must be no earlier than the admission date. In that case the command becomes:

Code:

rangejoin date admissiondate ., by(procedureno genderp)

Note: In order to use -rangejoin- you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Added: If you can get age, there is one more complication to deal with. NSQIP topcodes age at 90. So if you have exact ages in your hospitalization data, you have to topcode them at 90 as well so that comparing ages will work more or less correctly. Of course it means that if you had a 94 year old and a 98 year old who had the same procedure on the same or nearby dates, you wouldn't be able to distinguish them in the data, but I doubt there will be many such instances. And, in any case, you will still be better off with a partial disambiguation by age than with none at all.

Last edited by Clyde Schechter; 15 Dec 2022, 15:59.
2 likes
Comment
Denise Vella

Join Date: Aug 2022

Posts: 187
#7

15 Dec 2022, 20:32

Originally posted by Clyde Schechter View Post

If true, this would suggest that the best approach would be to combine the two observations into a single one in some way, and then you could do a 1:m or m:1 merge. But this doesn't seem to actually be true in the example data you show. For example, in the procedure example data, procedure #999 appears 4 times (so clearly not just an R and L), and these four occasions span a date range of 120 days, not the same sitting.

Yes correct one could be a failed surgery therefore a revision.

re this:

If, in fact, discharge dates are not available, it is harder. In that case, all you can really be sure is that the procedure date must be no earlier than the admission date. In that case the command becomes

what if i want to include any proceduredate that is within 1 day of the admission date. Can I still do this with rangejoin?

for clarification, no I am not working with nsqip, but i actually have a database that has procedurecodes. Every implant inserted into a human has a procedurecode.
Comment
Denise Vella

Join Date: Aug 2022

Posts: 187
#8

15 Dec 2022, 20:33

Btw and its great learning from pros like you all ! Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#9

15 Dec 2022, 22:14

what if i want to include any proceduredate that is within 1 day of the admission date. Can I still do this with rangejoin?

Yes. It's just a little tricky: you have to reverse the roles of the data sets for that. By within 1 day of the admission date, I assume you mean within 1 day after the admission date, not before.

Code:

use procedure_data_set, clear rename gender_p gender gen lowest = date-1 rangejoin admission_date lowest date using hospitalization_data_set, by(procedureno gender)
Comment
Denise Vella

Join Date: Aug 2022

Posts: 187
#10

16 Dec 2022, 00:43

Originally posted by Clyde Schechter View Post

Code:

use hospitalizations_data_set, clear rename gender genderp // TO MATCH THE OTHER DATA SET rangejoin date admissiondate dischargedate, by(procedureno genderp)

Hi I'm not officially working my way through this. Before using rangejoin I assume I need to use joinby?

As I need to bring the genderp from the proceduredataset....in order to match this up for the above code to work?
Comment

daniel klein

Join Date: Mar 2014
Posts: 3842

#11

16 Dec 2022, 01:14

Disclaimer: I have not followed this thread in detail and I have not followed the non-cited thread either. I will focus on the question:Why critic[ci]se? Why do we not like m:m merges?

Let's start with the resulting dataset implied (but not shown) in #1:

Code:

. list

     +----------------------------------------------------------------------------------------------------------------------+
     | proced~o   ssi   dehisc~e   pe   uniqueno   admissi~e   gender   side   sidep   genderp        date           _merge |
     |----------------------------------------------------------------------------------------------------------------------|
  1. |      888     0          1    1         1A   31dec2021        M      R       R         M   01jan2022      Matched (3) |
  2. |      888     1          1    1         2A   14feb2022        M      R       R         M   01jan2022      Matched (3) |
  3. |      999     1          1    1         3A   01mar2022        F      R       L         F   02mar2022      Matched (3) |
  4. |      999     1          0    1         4A   02apr2022        F      L       L         F   02may2022      Matched (3) |
  5. |      200     .          .    .                      .        .      .       L         M   02jun2022   Using only (2) |
     |----------------------------------------------------------------------------------------------------------------------|
  6. |      999     1          0    1         4A   02apr2022        F      L       L         M   02jun2022      Matched (3) |
  7. |      999     1          0    1         4A   02apr2022        F      L       R         F   02feb2022      Matched (3) |
     +----------------------------------------------------------------------------------------------------------------------+

The claim is that the m:m worked. Did it? Let's focus on the match in observation number 3.

Code:

3. |      999     1          1    1         3A   01mar2022        F      R       L         F   02mar2022      Matched (3) |

Here is female (genderp) who had surgery 999 on March 2nd after being admitted on March 1st.

What happens when the hospital dataset happens to have a different sort order?

Code:

load the hospital dataset
version 17 : set seed 4273
generate random = runiform()
sort random
drop random

merge m:m procedureno using "file location"

. list

     +----------------------------------------------------------------------------------------------------------------------+
     | proced~o   ssi   dehisc~e   pe   uniqueno   admissi~e   gender   side   sidep   genderp        date           _merge |
     |----------------------------------------------------------------------------------------------------------------------|
  1. |      888     1          1    1         2A   14feb2022        M      R       R         M   01jan2022      Matched (3) |
  2. |      888     0          1    1         1A   31dec2021        M      R       R         M   01jan2022      Matched (3) |
  3. |      999     1          0    1         4A   02apr2022        F      L       L         F   02mar2022      Matched (3) |
  4. |      999     1          1    1         3A   01mar2022        F      R       L         F   02may2022      Matched (3) |
  5. |      200     .          .    .                      .        .      .       L         M   02jun2022   Using only (2) |
     |----------------------------------------------------------------------------------------------------------------------|
  6. |      999     1          1    1         3A   01mar2022        F      R       L         M   02jun2022      Matched (3) |
  7. |      999     1          1    1         3A   01mar2022        F      R       R         F   02feb2022      Matched (3) |
     +----------------------------------------------------------------------------------------------------------------------+

Hm, observation 6 implies that our female, admitted March 1st, has now apparently swapped gender and also postponed surgery until June 2nd. Oh, wait! Looking at observation number 7, she is indeed female. But she had surgery before she was admitted.

Well, that is why we do not like m:m merges.

Last edited by daniel klein; 16 Dec 2022, 01:17.

Comment

daniel klein

Join Date: Mar 2014
Posts: 3842

#12

16 Dec 2022, 01:20

Here is the full code to replicate results from #11

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(procedureno indexcode new_dvt new_mi pe sidep genderp date)
200 15 1 0 0 2 1 22798
888  9 . 1 1 1 1 22646
999 13 1 1 . 2 2 22706
999 15 1 0 . 2 2 22767
999 16 1 0 1 2 1 22798
999 10 1 1 1 1 2 22678
end
format %td date
label values sidep side
label def side 1 "R", modify
label def side 2 "L", modify
label values genderp gender
label def gender 1 "M", modify
label def gender 2 "F", modify

/*
    I'll make this smaller as we are only 
    using the following variables, anyway.
*/
keep procedureno genderp sidep date

// save that for -merge- later
tempfile procedure
save "`procedure'"


* Example generated by -dataex-. For more info, type help dataex
clear
input float(procedureno ssi dehiscence pe) str3 uniqueno float(admissiondate gender side)
888 0 1 1 "1A" 22645 1 1
888 1 1 1 "2A" 22690 1 1
999 1 1 1 "3A" 22705 2 1
999 1 0 1 "4A" 22737 2 2
end
format %td admissiondate
label values gender gender
label def gender 1 "M", modify
label def gender 2 "F", modify
label values side side
label def side 1 "R", modify
label def side 2 "L", modify


// also save the original hospital dataset
tempfile hospitals
save "`hospitals'"


// the -merge- that is claimed to work:
merge m:m procedureno using "`procedure'"
list

// let's try again, shall we
use "`hospitals'" , clear
version 17 : set seed 4273
generate random = runiform()
sort random
drop random

// the -merge- that does not work so well
merge m:m procedureno using "`procedure'"

Comment

Denise Vella

Join Date: Aug 2022

Posts: 187
#13

16 Dec 2022, 06:49

Clyde Schechter

So thanks for explaining m:m is bad.

I'm now trying to use rangejoin.

Which on my dummy dataset has worked nicely bearing in mind it only had 6 rows of data.

I'm not using it on the realdataset which has 1mill observations.

I have loaded my proceduredata set -
rename gender_p gender gen lowest = date-1 keep procedureno gender OpDate lowest rangejoin admissiondate lowest OpDate using "hospital dataset.dta", by (Procedureno gender) All the variables joined from the hospitaldataset have all been set to missing... Which is impossible - as I checked there is a procedureno 3002 - 2 records can be found in the hospital dataset - with Admin date 11 Apr 2003, 23 Apr 2003, is a Male In the procedure dataset - the Opdate is 24 Apr and thus lowest is 23 Apr and is a Male Therefore the hospitaldataset 3002 record should be retained for 23 Apr 2003 whilst the hospitaldata for the admissiondate for 11 Apr 2003 should be kept as missing. Are there limits with rangejoin? If i can't sort the problem I'll use joinby and manually use keep and duplicatetag to find those patients with the same gender and whose admissiondate = OpDate or 1 day prior to their Op date
i wonder if you could help me troubleshoot
Comment
Denise Vella

Join Date: Aug 2022

Posts: 187
#14

16 Dec 2022, 09:50

Robert Picard I am generating a full dataset of missing variables when using rangejoin , This is what i've done

I have loaded my proceduredata set -
rename gender_p gender
gen lowest = date-1
keep procedureno gender OpDate lowest
rangejoin admissiondate lowest OpDate using "hospital dataset.dta", by (Procedureno gender)

All the variables joined from the hospitaldataset have all been set to missing... Which is impossible -

EXAMPLE:

I checked there is a procedureno 3002 - 2 records can be found in the hospital dataset - with Admin date 11 Apr 2003, 23 Apr 2003, is a Male

In the procedure dataset - the Opdate is 24 Apr and thus lowest is 23 Apr and is a Male

Therefore the hospitaldataset 3002 record should be retained for 23 Apr 2003 whilst the hospitaldata for the admissiondate for 11 Apr 2003 should be kept as missing.

Is it because the lowest shouldn't be equivalent to the admissiondate? (similar to rangestat?)

If i can't sort the problem I'll use joinby and manually use keep and duplicatetag to find those patients with the same gender and whose admissiondate = OpDate or 1 day prior to their Op date
i wonder if you could help me troubleshoot

Attached Files

Last edited by Denise Vella; 16 Dec 2022, 09:53.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#15

16 Dec 2022, 10:17

Below I combine the code from daniel klein in post #12 to read your example datasets from post #1 with the code from Clyde Schechter in post #10, changing a few variable names in the latter as necessary.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(procedureno indexcode new_dvt new_mi pe sidep genderp date)
200 15 1 0 0 2 1 22798
888  9 . 1 1 1 1 22646
999 13 1 1 . 2 2 22706
999 15 1 0 . 2 2 22767
999 16 1 0 1 2 1 22798
999 10 1 1 1 1 2 22678
end
format %td date
label values sidep side
label def side 1 "R", modify
label def side 2 "L", modify
label values genderp gender
label def gender 1 "M", modify
label def gender 2 "F", modify

tempfile procedure
save "`procedure'"

* Example generated by -dataex-. For more info, type help dataex
clear
input float(procedureno ssi dehiscence pe) str3 uniqueno float(admissiondate gender side)
888 0 1 1 "1A" 22645 1 1
888 1 1 1 "2A" 22690 1 1
999 1 1 1 "3A" 22705 2 1
999 1 0 1 "4A" 22737 2 2
end
format %td admissiondate
label values gender gender
label def gender 1 "M", modify
label def gender 2 "F", modify
label values side side
label def side 1 "R", modify
label def side 2 "L", modify

tempfile hospitals
save "`hospitals'"

use "`procedure'", clear
rename genderp gender
gen lowest = date-1
format %td lowest
rangejoin admissiondate lowest date using "`hospitals'", by(procedureno gender)
list  procedureno lowest date admissiondate gender ssi dehiscence pe uniqueno , clean noobs

Code:

. use "`procedure'", clear

. rename genderp gender

. gen lowest = date-1

. format %td lowest

. rangejoin admissiondate lowest date using "`hospitals'", by(procedureno gender)
  (using rangestat version 1.1.1)

. list  procedureno lowest date admissiondate gender ssi dehiscence pe uniqueno , clean noobs

    proced~o      lowest        date   admissi~e   gender   ssi   dehisc~e   pe   uniqueno  
         200   01jun2022   02jun2022           .        M     .          .    0             
         888   31dec2021   01jan2022   31dec2021        M     0          1    1         1A  
         999   01mar2022   02mar2022   01mar2022        F     1          1    .         3A  
         999   01may2022   02may2022           .        F     .          .    .             
         999   01jun2022   02jun2022           .        M     .          .    1             
         999   01feb2022   02feb2022           .        F     .          .    1             

.

Announcement