Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging

    When I think I'm nearly there - another problem crops up.

    I have two big datasets - here's a sample

    hospital dataset

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(procedureno ssi dehiscence pe) str3 uniqueno float admissiondate
    112 0 1 1 "11A" 22645
    113 1 1 1 "12A" 22690
    113 1 1 1 "13A" 22705
    114 1 0 1 "14A" 22737
    115 1 0 1 "15A" 22738
    end
    format %td admissiondate

    procedure dataset

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(procedureno indexcode new_dvt new_mi pe date)
    112  9 . 1 1 22646
    112 10 1 1 1 22646
    113 13 1 1 . 22706
    113 15 1 0 . 22706
    114 16 1 0 1 22738
    115 17 . . . 22739
    end
    format %td date

    Aim:

    I would like to merge the hospital data set onto the procedure dataset.

    The hospital dataset consists of duplicate patients with the same procedureno but different uniqueno. In order to decide row to keep, I need to merge the hospital dataset onto the proceduredataset.
    Then create a dummy variable whereby Admissiondate - date. The one with the admissiondate closest to the opdate will be kept, the other dropped.

    *date is the day the operation took place

    However:

    A.
    There are duplicates in the procedure dataset due to patient having a Right + Left operation - therefore have a different indexcode but the same procedureno (as the operation was done in the same sitting)
    Both right and left operation need to remain in the data set.
    Only procedureno is the variable that is shared amongst the two datasets


    Intially I was going to merge 1:m - however on realising the problem with A of course I couldn't do it.

    Then I thought of merge m:m --> however there has been a lot of critism about this.

    So then I though of what if I pull the date variable only from the procedure dataset merge this onto the hospital dataset, create the dummy variable Admission - date. Keep the ones I need. Save this as unique dataset.

    use "hospitaldataset.dta"
    merge procedureno using "proceduredataset.dta", keep (date)

    It then tells me i'M USING old syntax - but it still does the job. Any critisims?
    Last edited by Denise Vella; 15 Dec 2022, 06:16.

  • #2
    Would it help to define a new double procedure 2112 that comprises left plus right? Then remove the two 112 procedure observations and add a 2112 record with the same information? I think that would simplify the merge.

    Comment


    • #3
      In the output of help merge there is a link to the help file for the old syntax, it is
      Code:
      help merge_10
      Careful reading of that documentation suggests that the following applies in your case.
      Code:
      If none of the three unique options are specified, observations in
      neither the master nor the using dataset are required to be unique,
      although they could be.  If they are not unique, records that have
      the same values of the match variables are joined by observation
      until all the records on one side or the other are matched; after
      that, the final record on the shorter side is duplicated over and
      over again to match with the remaining records needing to be matched
      on the longer side.
      In other words, you have done the equivalent of a merge m:m in the current syntax. You will note the similarity to the following explanation copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

      m:m merges

      m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

      Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
      With that said, if you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

      1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

      2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

      3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

      4. You actually need to append your datasets rather than merge them.

      5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.

      Comment

      Working...
      X