Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging m:m - it seems to work for me. Why critise? Appreciate your feedback

    As you can see in my previous posts. I currently have one huge dataset called procedure.

    This shows each time a patient had an operation indicated with procedureno.
    If the patient had a R or subsequently L operation it is coded with a unique indexcode (but may have the same procedureno as it may have been done in the same sitting.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(procedureno indexcode new_dvt new_mi pe sidep genderp date)
    200 15 1 0 0 2 1 22798
    888  9 . 1 1 1 1 22646
    999 13 1 1 . 2 2 22706
    999 15 1 0 . 2 2 22767
    999 16 1 0 1 2 1 22798
    999 10 1 1 1 1 2 22678
    end
    format %td date
    label values sidep side
    label def side 1 "R", modify
    label def side 2 "L", modify
    label values genderp gender
    label def gender 1 "M", modify
    label def gender 2 "F", modify

    I then have a hospital dataset. Which shows each time the patient came to hospital for any reason. //The only unique identifier between hospital dataset & procedure dataset is the procedureno //

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(procedureno ssi dehiscence pe) str3 uniqueno float(admissiondate gender side)
    888 0 1 1 "1A" 22645 1 1
    888 1 1 1 "2A" 22690 1 1
    999 1 1 1 "3A" 22705 2 1
    999 1 0 1 "4A" 22737 2 2
    end
    format %td admissiondate
    label values gender gender
    label def gender 1 "M", modify
    label def gender 2 "F", modify
    label values side side
    label def side 1 "R", modify
    label def side 2 "L", modify

    A.
    Please bear in mind I have already screened the hospital dataset and am working with a subset of patients which are duplicates - trying to identify which to keep.


    Now as the proceduredata set is very big I would like to merge using the hospitaldataset as the master dataset & procedure dataset as the using dataset.
    Unique identifier to merge - will be procedureno

    And from the proceduredata I onyl would like to keep genderp (gender) , sidep (side of surgery), date which is the operation date.

    Thus, knowing there are duplicates in my hospitaldataset + duplicates in my proceduredataset (as patients who have the same procedureno may have had a R + L operation) I need to merge using the following code:

    merge procedureno using "file location", keep (genderp sidep date)

    Result:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(procedureno ssi dehiscence pe) str3 uniqueno float(admissiondate gender side sidep genderp date) byte _merge
    888 0 1 1 "1A" 22645 1 1 1 1 22646 3
    888 1 1 1 "2A" 22690 1 1 1 1 22646 3
    999 1 1 1 "3A" 22705 2 1 1 2 22678 3
    999 1 0 1 "4A" 22737 2 2 2 2 22706 3
    200 . . . ""       . . . 2 1 22798 2
    999 1 0 1 "4A" 22737 2 2 2 1 22798 3
    999 1 0 1 "4A" 22737 2 2 2 2 22767 3
    end
    format %td admissiondate
    format %td date
    label values sidep side
    label def side 1 "R", modify
    label def side 2 "L", modify
    label values genderp gender
    label def gender 1 "M", modify
    label def gender 2 "F", modify

    1. The 'extra ones from proceduredata which remains unmatched will appear as missing in the variables ssi, dehiscence, pe, uniqueno etc as seen in procedureno 200. Which I am aware will happen as I have already removed the unique values from the hospitaldata set (as explained in A)

    Therefore:
    //Find the variables that do not have observations within the hospital dataset apart from those variables that you have matched on

    generate miss = missing(genderp, gender, ssi, dehiscence, pe, uniqueno)
    drop if miss == 1

    //fIND the patients that are the same patients - ie THE pt has the procedureno + same gender identified from hospitaldataset + gender from proceduredataset
    duplicates tag genderp gender procedureno, gen(duplicates)

    //Values 0 if gender from hospitaldataset + gender from proceduredataset are different
    drop if duplicates == 0

    //Focus on those that are the same patients (same gender + same side) and pick the ones with the least days from the admissiondate
    gen difference = date - admissiondate
    keep if difference <= 70 & sidep == side

    *If a patient has had a revision operation of the same side, this should not normally be within 70 days hence the reason why I have used 70 days. In fact I will change this to perhaps 5 days.

    This merge to many seems to work for me. Now of course I'm working on a dummy dataset as I want to see how it works. But why so much critism? I don't see another way round it.
    I can't use append, nor can I use 1:m or m:1 at this time as none of unique values, this is the the most consice dataset I have managed to get it too.


    Im working on duplicates with the aim of creating unique values.

    I will then merge again using m: 1


  • #2
    Have a look at https://journals.sagepub.com/doi/ful...6867X211063416.

    Comment


    • #3
      For those who at some later point are directed to this topic by a search engine, and have not seen the earlier topics referred but not linked to in post #1, this earlier post of mine suggests alternatives to merge m:m, of which joinby seems to be the most likely to be helpful here.

      It also explains that the merge command used in the code in post #1 uses old syntax for the merge command that was superseded in Stata 11, and it is effectively identical to merge m:m using the current syntax.
      Last edited by William Lisowski; 15 Dec 2022, 15:09.

      Comment


      • #4
        This merge to many seems to work for me. Now of course I'm working on a dummy dataset as I want to see how it works. But why so much critism? I don't see another way round it.
        It seems to work for you. But if you go to your real data and scrutinize the results you get you will see that -merge m:m- is pairing up things that have nothing to do with each other, and probably also missing pairing up things that should be paired with each other. Yes, you are getting results, but they are wrong results that make arbitrary and meaningless pairings between observations in the two data sets.

        There is a way around it. -joinby- is the appropriate way to do this, as William Lisowski and Andrew Musau have already pointed out.

        If the patient had a R or subsequently L operation it is coded with a unique indexcode (but may have the same procedureno as it may have been done in the same sitting.
        If true, this would suggest that the best approach would be to combine the two observations into a single one in some way, and then you could do a 1:m or m:1 merge. But this doesn't seem to actually be true in the example data you show. For example, in the procedure example data, procedure #999 appears 4 times (so clearly not just an R and L), and these four occasions span a date range of 120 days, not the same sitting.

        Comment


        • #5
          Some have said joinby (and I agree!!!!!) is the real m:m. Anyways, I second the advice above. Oh, and StataCorp literally advises us to not use m:m, to the point they won't even give an example for it. So, I won't argue with StataCorp on that issue!

          Comment


          • #6
            Let me go out on a limb here and speculate about what your data really are. I'm guessing that your procedures data set is taken from the National Surgery Quality Improvement Program (NSQIP) database and that what you are calling "procedureno" is not actually a distinct identifier for procedures but is just a CPT code (Current Procedural Terminology) designating the type of procedure. Your other data base is a data set of hospitalizations, which also has CPT codes, and "uniqueno" which is a unique patient identifier, as well as a date (which might be admission date or discharge date). By contrast the date in the procedures data set is going to be the actual date of the procedure, which will, in general, be close to, but not the same as, the date in the hospitalizations database. And you are trying to pair up each observation in the procedures database to the corresponding hospitalization. That's going to be very difficult to do, especially since the procedure data set does not identify the patient.

            If I'm on the right track here, then the most effective way to do this will actually be with Robert Picard's -rangejoin- command, available from SSC. Let me assume that the hospitalizations data set contains the admission date and the discharge date, so that you want to match hospitalizations with procedures whose dates fall between admission and discharge. Let's call those two date variable admissiondate, and dischargedate, respectively. Now, there is potential for spurious matches to be found. Two people in your hospital might have had the same procedure done, even on the same date, and these variables alone will not distinguish which is which. You can narrow it down a bit because your procedure database has a gender variable, which would have to match the genera variable in the hospitalizations data base. And I'll bet if you redid your data pull you could also get age in both data sets. Then you would do something like this:
            Code:
            use hospitalizations_data_set, clear
            rename gender genderp // TO MATCH THE OTHER DATA SET
            rangejoin date admissiondate dischargedate, by(procedureno genderp)
            This will pair up each hospitalization observation with any and all observations of the procedure data set that have the same procedureno and where the patients are of the same sex. To try to whittle it down further, you can then drop any pairs where the ages differ by more than the duration of the admission + 1 year. You might be able to narrow things down a bit more by using other clues in the data. For example, race and ethnicity are probably available (although they may be ascertained differently in the two data sets, and so not entirely reliable). Also, I notice that there is another variable, pe, that appears in both data sets. If it means the same in both, and if they ought to match, then you could narrow things down even more by including that pe variable in the -by()- option of the -rangejoin- command.

            In the end, there will probably be some cases that you cannot disambiguate and you may have to just discard those hospitalizations, or choose from among the eligible matches at random.

            If, in fact, discharge dates are not available, it is harder. In that case, all you can really be sure is that the procedure date must be no earlier than the admission date. In that case the command becomes:
            Code:
            rangejoin date admissiondate ., by(procedureno genderp)
            Note: In order to use -rangejoin- you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

            Added: If you can get age, there is one more complication to deal with. NSQIP topcodes age at 90. So if you have exact ages in your hospitalization data, you have to topcode them at 90 as well so that comparing ages will work more or less correctly. Of course it means that if you had a 94 year old and a 98 year old who had the same procedure on the same or nearby dates, you wouldn't be able to distinguish them in the data, but I doubt there will be many such instances. And, in any case, you will still be better off with a partial disambiguation by age than with none at all.
            Last edited by Clyde Schechter; 15 Dec 2022, 15:59.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post

              If true, this would suggest that the best approach would be to combine the two observations into a single one in some way, and then you could do a 1:m or m:1 merge. But this doesn't seem to actually be true in the example data you show. For example, in the procedure example data, procedure #999 appears 4 times (so clearly not just an R and L), and these four occasions span a date range of 120 days, not the same sitting.
              Yes correct one could be a failed surgery therefore a revision.

              re this:

              If, in fact, discharge dates are not available, it is harder. In that case, all you can really be sure is that the procedure date must be no earlier than the admission date. In that case the command becomes

              what if i want to include any proceduredate that is within 1 day of the admission date. Can I still do this with rangejoin?

              for clarification, no I am not working with nsqip, but i actually have a database that has procedurecodes. Every implant inserted into a human has a procedurecode.

              Comment


              • #8
                Btw and its great learning from pros like you all ! Thanks

                Comment


                • #9
                  what if i want to include any proceduredate that is within 1 day of the admission date. Can I still do this with rangejoin?
                  Yes. It's just a little tricky: you have to reverse the roles of the data sets for that. By within 1 day of the admission date, I assume you mean within 1 day after the admission date, not before.
                  Code:
                  use procedure_data_set, clear
                  rename gender_p gender
                  gen lowest = date-1
                  rangejoin admission_date lowest date using hospitalization_data_set, by(procedureno gender)

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    Code:
                    use hospitalizations_data_set, clear
                    rename gender genderp // TO MATCH THE OTHER DATA SET
                    rangejoin date admissiondate dischargedate, by(procedureno genderp)
                    Hi I'm not officially working my way through this. Before using rangejoin I assume I need to use joinby?

                    As I need to bring the genderp from the proceduredataset....in order to match this up for the above code to work?

                    Comment


                    • #11
                      Disclaimer: I have not followed this thread in detail and I have not followed the non-cited thread either. I will focus on the question:Why critic[ci]se? Why do we not like m:m merges?

                      Let's start with the resulting dataset implied (but not shown) in #1:

                      Code:
                      . list
                      
                           +----------------------------------------------------------------------------------------------------------------------+
                           | proced~o   ssi   dehisc~e   pe   uniqueno   admissi~e   gender   side   sidep   genderp        date           _merge |
                           |----------------------------------------------------------------------------------------------------------------------|
                        1. |      888     0          1    1         1A   31dec2021        M      R       R         M   01jan2022      Matched (3) |
                        2. |      888     1          1    1         2A   14feb2022        M      R       R         M   01jan2022      Matched (3) |
                        3. |      999     1          1    1         3A   01mar2022        F      R       L         F   02mar2022      Matched (3) |
                        4. |      999     1          0    1         4A   02apr2022        F      L       L         F   02may2022      Matched (3) |
                        5. |      200     .          .    .                      .        .      .       L         M   02jun2022   Using only (2) |
                           |----------------------------------------------------------------------------------------------------------------------|
                        6. |      999     1          0    1         4A   02apr2022        F      L       L         M   02jun2022      Matched (3) |
                        7. |      999     1          0    1         4A   02apr2022        F      L       R         F   02feb2022      Matched (3) |
                           +----------------------------------------------------------------------------------------------------------------------+
                      The claim is that the m:m worked. Did it? Let's focus on the match in observation number 3.

                      Code:
                      3. |      999     1          1    1         3A   01mar2022        F      R       L         F   02mar2022      Matched (3) |
                      Here is female (genderp) who had surgery 999 on March 2nd after being admitted on March 1st.

                      What happens when the hospital dataset happens to have a different sort order?

                      Code:
                      load the hospital dataset
                      version 17 : set seed 4273
                      generate random = runiform()
                      sort random
                      drop random
                      
                      merge m:m procedureno using "file location"
                      
                      . list
                      
                           +----------------------------------------------------------------------------------------------------------------------+
                           | proced~o   ssi   dehisc~e   pe   uniqueno   admissi~e   gender   side   sidep   genderp        date           _merge |
                           |----------------------------------------------------------------------------------------------------------------------|
                        1. |      888     1          1    1         2A   14feb2022        M      R       R         M   01jan2022      Matched (3) |
                        2. |      888     0          1    1         1A   31dec2021        M      R       R         M   01jan2022      Matched (3) |
                        3. |      999     1          0    1         4A   02apr2022        F      L       L         F   02mar2022      Matched (3) |
                        4. |      999     1          1    1         3A   01mar2022        F      R       L         F   02may2022      Matched (3) |
                        5. |      200     .          .    .                      .        .      .       L         M   02jun2022   Using only (2) |
                           |----------------------------------------------------------------------------------------------------------------------|
                        6. |      999     1          1    1         3A   01mar2022        F      R       L         M   02jun2022      Matched (3) |
                        7. |      999     1          1    1         3A   01mar2022        F      R       R         F   02feb2022      Matched (3) |
                           +----------------------------------------------------------------------------------------------------------------------+
                      Hm, observation 6 implies that our female, admitted March 1st, has now apparently swapped gender and also postponed surgery until June 2nd. Oh, wait! Looking at observation number 7, she is indeed female. But she had surgery before she was admitted.

                      Well, that is why we do not like m:m merges.
                      Last edited by daniel klein; 16 Dec 2022, 01:17.

                      Comment


                      • #12
                        Here is the full code to replicate results from #11

                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input float(procedureno indexcode new_dvt new_mi pe sidep genderp date)
                        200 15 1 0 0 2 1 22798
                        888  9 . 1 1 1 1 22646
                        999 13 1 1 . 2 2 22706
                        999 15 1 0 . 2 2 22767
                        999 16 1 0 1 2 1 22798
                        999 10 1 1 1 1 2 22678
                        end
                        format %td date
                        label values sidep side
                        label def side 1 "R", modify
                        label def side 2 "L", modify
                        label values genderp gender
                        label def gender 1 "M", modify
                        label def gender 2 "F", modify
                        
                        /*
                            I'll make this smaller as we are only 
                            using the following variables, anyway.
                        */
                        keep procedureno genderp sidep date
                        
                        // save that for -merge- later
                        tempfile procedure
                        save "`procedure'"
                        
                        
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input float(procedureno ssi dehiscence pe) str3 uniqueno float(admissiondate gender side)
                        888 0 1 1 "1A" 22645 1 1
                        888 1 1 1 "2A" 22690 1 1
                        999 1 1 1 "3A" 22705 2 1
                        999 1 0 1 "4A" 22737 2 2
                        end
                        format %td admissiondate
                        label values gender gender
                        label def gender 1 "M", modify
                        label def gender 2 "F", modify
                        label values side side
                        label def side 1 "R", modify
                        label def side 2 "L", modify
                        
                        
                        // also save the original hospital dataset
                        tempfile hospitals
                        save "`hospitals'"
                        
                        
                        // the -merge- that is claimed to work:
                        merge m:m procedureno using "`procedure'"
                        list
                        
                        // let's try again, shall we
                        use "`hospitals'" , clear
                        version 17 : set seed 4273
                        generate random = runiform()
                        sort random
                        drop random
                        
                        // the -merge- that does not work so well
                        merge m:m procedureno using "`procedure'"

                        Comment


                        • #13
                          Clyde Schechter

                          So thanks for explaining m:m is bad.

                          I'm now trying to use rangejoin.

                          Which on my dummy dataset has worked nicely bearing in mind it only had 6 rows of data.

                          I'm not using it on the realdataset which has 1mill observations.

                          I have loaded my proceduredata set -
                          rename gender_p gender gen lowest = date-1 keep procedureno gender OpDate lowest rangejoin admissiondate lowest OpDate using "hospital dataset.dta", by (Procedureno gender) All the variables joined from the hospitaldataset have all been set to missing... Which is impossible - as I checked there is a procedureno 3002 - 2 records can be found in the hospital dataset - with Admin date 11 Apr 2003, 23 Apr 2003, is a Male In the procedure dataset - the Opdate is 24 Apr and thus lowest is 23 Apr and is a Male Therefore the hospitaldataset 3002 record should be retained for 23 Apr 2003 whilst the hospitaldata for the admissiondate for 11 Apr 2003 should be kept as missing. Are there limits with rangejoin? If i can't sort the problem I'll use joinby and manually use keep and duplicatetag to find those patients with the same gender and whose admissiondate = OpDate or 1 day prior to their Op date
                          i wonder if you could help me troubleshoot

                          Comment


                          • #14
                            Robert Picard I am generating a full dataset of missing variables when using rangejoin , This is what i've done

                            I have loaded my proceduredata set -
                            rename gender_p gender
                            gen lowest = date-1
                            keep procedureno gender OpDate lowest
                            rangejoin admissiondate lowest OpDate using "hospital dataset.dta", by (Procedureno gender)

                            All the variables joined from the hospitaldataset have all been set to missing... Which is impossible -

                            EXAMPLE:

                            I checked there is a procedureno 3002 - 2 records can be found in the hospital dataset - with Admin date 11 Apr 2003, 23 Apr 2003, is a Male

                            In the procedure dataset - the Opdate is 24 Apr and thus lowest is 23 Apr and is a Male

                            Therefore the hospitaldataset 3002 record should be retained for 23 Apr 2003 whilst the hospitaldata for the admissiondate for 11 Apr 2003 should be kept as missing.

                            Is it because the lowest shouldn't be equivalent to the admissiondate? (similar to rangestat?)

                            If i can't sort the problem I'll use joinby and manually use keep and duplicatetag to find those patients with the same gender and whose admissiondate = OpDate or 1 day prior to their Op date
                            i wonder if you could help me troubleshoot

                            Click image for larger version

Name:	viber_image_2022-12-16_13-45-19-973.jpg
Views:	2
Size:	112.6 KB
ID:	1693856
                            Attached Files
                            Last edited by Denise Vella; 16 Dec 2022, 09:53.

                            Comment


                            • #15
                              Below I combine the code from daniel klein in post #12 to read your example datasets from post #1 with the code from Clyde Schechter in post #10, changing a few variable names in the latter as necessary.
                              Code:
                              * Example generated by -dataex-. For more info, type help dataex
                              clear
                              input float(procedureno indexcode new_dvt new_mi pe sidep genderp date)
                              200 15 1 0 0 2 1 22798
                              888  9 . 1 1 1 1 22646
                              999 13 1 1 . 2 2 22706
                              999 15 1 0 . 2 2 22767
                              999 16 1 0 1 2 1 22798
                              999 10 1 1 1 1 2 22678
                              end
                              format %td date
                              label values sidep side
                              label def side 1 "R", modify
                              label def side 2 "L", modify
                              label values genderp gender
                              label def gender 1 "M", modify
                              label def gender 2 "F", modify
                              
                              tempfile procedure
                              save "`procedure'"
                              
                              * Example generated by -dataex-. For more info, type help dataex
                              clear
                              input float(procedureno ssi dehiscence pe) str3 uniqueno float(admissiondate gender side)
                              888 0 1 1 "1A" 22645 1 1
                              888 1 1 1 "2A" 22690 1 1
                              999 1 1 1 "3A" 22705 2 1
                              999 1 0 1 "4A" 22737 2 2
                              end
                              format %td admissiondate
                              label values gender gender
                              label def gender 1 "M", modify
                              label def gender 2 "F", modify
                              label values side side
                              label def side 1 "R", modify
                              label def side 2 "L", modify
                              
                              tempfile hospitals
                              save "`hospitals'"
                              
                              use "`procedure'", clear
                              rename genderp gender
                              gen lowest = date-1
                              format %td lowest
                              rangejoin admissiondate lowest date using "`hospitals'", by(procedureno gender)
                              list  procedureno lowest date admissiondate gender ssi dehiscence pe uniqueno , clean noobs
                              Code:
                              . use "`procedure'", clear
                              
                              . rename genderp gender
                              
                              . gen lowest = date-1
                              
                              . format %td lowest
                              
                              . rangejoin admissiondate lowest date using "`hospitals'", by(procedureno gender)
                                (using rangestat version 1.1.1)
                              
                              . list  procedureno lowest date admissiondate gender ssi dehiscence pe uniqueno , clean noobs
                              
                                  proced~o      lowest        date   admissi~e   gender   ssi   dehisc~e   pe   uniqueno  
                                       200   01jun2022   02jun2022           .        M     .          .    0             
                                       888   31dec2021   01jan2022   31dec2021        M     0          1    1         1A  
                                       999   01mar2022   02mar2022   01mar2022        F     1          1    .         3A  
                                       999   01may2022   02may2022           .        F     .          .    .             
                                       999   01jun2022   02jun2022           .        M     .          .    1             
                                       999   01feb2022   02feb2022           .        F     .          .    1             
                              
                              .

                              Comment

                              Working...
                              X