Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge update, replace

    Hi,
    I'm performing a merge of two datasets using the update option. I get the warning that there are 100 subjects who have a non missing conflict as described in the manual (https://www.stata.com/manuals13/dmerge.pdf).

    merge 1:1 subject_Id using StudyB, update

    I would like to somehow know which variables and subjects have the non missing conflicts. How can I perform this?

  • #2
    when Stata gives you back the result of the merge it is also giving you a new variable (by default name "_merge") and a table; values of this variable equal to 5 the values you want to investigate

    Comment


    • #3
      That is true..
      However, my master dataset has close to 100 variables and the using dataset has another 100.
      So, you are correct that I will know which subjects fall under this category, but I don't know which variables of that subject cause it.

      If it were a small amount of variables, I would easily be able to eyeball it and find the differences. However there are too many variables for me to try to eyeball.
      That is what I'm trying to figure out.

      Comment


      • #4
        This kind of problem often arises in my workflow. The way I usually handle it is to go back and -append- the two data sets instead of -merge-ing them. Then I do -duplicates drop-, which eliminates anything that would have been left completely unchanged in the merge. Then I would sort the data on the merge key (subject_id in your case). I would order the variables common to the two data sets first. This way each subject's records in both data sets are lined up right below each other and I can get a grasp of what is going on. To drill down, I then loop over the common variables and check for conflicts. So something like this:

        Code:
        use dataset1, clear
        ds
        local vars1 `r(varlist)'
        des using dataset2, varlist
        local vars2 `r(varlist)'
        local common: list vars1 & vars2
        local exclude subject_id
        local common: list common - exclude
        
        gen byte source = 1
        
        append using dataset2
        replace source = 2 if missing(source)
        order subject_id source `common', first
        
        foreach c of local common {
            by subject_id (`c'), sort: egen byte conflict_`c' = max(`c' != `c'[1] & !missing(`c'))
        }
        You can then identify records with conflict on any particular variable v by running -browse if conflict_v-. Note that the code above is tailored to numeric variables. For a string variable conflict_`c' would be defined as max(`c' != `c'[_N] & !missing(`c')).

        Comment


        • #5
          Brilliant! Thank you, Clyde!

          Comment

          Working...
          X