Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I determine which are the problem variables after a merge with several "nonmissing conflict"s.

    Hi Everyone,

    I just merged two files (using Stata 11 on a mac) which resulted 23 instances of "nonmissing conflict" out of 500, where _merge = 5. There were a handful of overlapping variables in both datasets.

    I would like to find which were the variables that conflicted. I can weed out which observations using _merge = 5, but is there a way to determine which variables were the issue? I am assuming that the resultant file includes only values from the master dataset (I used update, but NOT replace), but I want to double check all of these values in both files since there is a conflict.

    Thanks for any help!
    Cara


  • #2
    So, you are correct that the merged data set will contain the values from the master data set only. Had you used the -replace- option you would have the values from the using data set only. To get them both, you need to rename them in one of the two data sets. So starting from your merged data, I would do the following steps. You don't say what the merge key variables were. For the sake of concreteness, I'll assume there was just one and I'm just going to call it merge_key. Mark up the code according to the actual variable(s) used for merging.
    Code:
    keep if _merge == 5
    
    // GET THE NAMES OF VARIALBES IN THE USING DATA SET
    // OTHER THAN THE MERGE KEY
    preserve
    use using_data, clear
    ds merge_key, not
    local using_vars `r(varlist)'
    
    // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET
    restore
    ds merge_key, not
    local master_vars `r(varlist)'
    local common: list master_vars & using_vars
    
    // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET
    foreach v of varlist `common' {
        rename `v' `v'_master
    }
    
    // NOW DO THE MERGE AGAIN
    merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, ///
        keep(match) assert(match using)
    And from here you can do whatever you want to do (graphs, summary statistics, browsing, whatever) to compare each the variables listed in `common' (which now come from the using data set) with the corresponding values in the master data set bearing the same name with _master at the end.

    Comment


    • #3
      So, you are correct that the merged data set will contain the values from the master data set only. Had you used the -replace- option you would have the values from the using data set only. To get them both, you need to rename them in one of the two data sets. So starting from your merged data, I would do the following steps. You don't say what the merge key variables were. For the sake of concreteness, I'll assume there was just one and I'm just going to call it merge_key. Mark up the code according to the actual variable(s) used for merging.
      Code:
      keep if _merge == 5
      
      // GET THE NAMES OF VARIALBES IN THE USING DATA SET
      // OTHER THAN THE MERGE KEY
      preserve
      use using_data, clear
      ds merge_key, not
      local using_vars `r(varlist)'
      
      // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET
      restore
      ds merge_key, not
      local master_vars `r(varlist)'
      local common: list master_vars & using_vars
      
      // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET
      foreach v of varlist `common' {
          rename `v' `v'_master
      }
      
      // NOW DO THE MERGE AGAIN
      merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, ///
          keep(match) assert(match using)
      And from here you can do whatever you want to do (graphs, summary statistics, browsing, whatever) to compare each the variables listed in `common' (which now come from the using data set) with the corresponding values in the master data set bearing the same name with _master at the end.

      Comment


      • #4
        So, you are correct that the merged data set will contain the values from the master data set only. Had you used the -replace- option you would have the values from the using data set only. To get them both, you need to rename them in one of the two data sets. So starting from your merged data, I would do the following steps. You don't say what the merge key variables were. For the sake of concreteness, I'll assume there was just one and I'm just going to call it merge_key. Mark up the code according to the actual variable(s) used for merging.
        Code:
        keep if _merge == 5
        
        // GET THE NAMES OF VARIALBES IN THE USING DATA SET
        // OTHER THAN THE MERGE KEY
        preserve
        use using_data, clear
        ds merge_key, not
        local using_vars `r(varlist)'
        
        // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET
        restore
        ds merge_key, not
        local master_vars `r(varlist)'
        local common: list master_vars & using_vars
        
        // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET
        foreach v of varlist `common' {
            rename `v' `v'_master
        }
        
        // NOW DO THE MERGE AGAIN
        merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, ///
            keep(match) assert(match using)
        And from here you can do whatever you want to do (graphs, summary statistics, browsing, whatever) to compare each the variables listed in `common' (which now come from the using data set) with the corresponding values in the master data set bearing the same name with _master at the end.

        Comment


        • #5
          Oops! Apologies for the multiple posts! The forum server was even less responsive than usual yesterday. Normally when I get hung on the "working" message I open a new browser window with the forum. That window usually shows that my post is there. So then I just close the original window. In the past that has prevented multiple posting. But yesterday, when I opened the new windows, there was no evidence that my post had gotten through, so I tried again. Sorry, sorry, sorry. It wastes everybody's time when posts are multiplied like that.

          Comment


          • #6
            You may also find the SSC program cfout helpful, which compares two datasets and outputs a dataset of differences.

            Comment


            • #7
              Thanks Clyde!

              You weren't the only one who sensed the server was less responsive than usual, hence my late reply.

              I'm not sure if your suggestion 100% clears things up for me - although it did clear something else up for me, unrelated, so thanks also for that. I did merge using only 1 key variable, as you correctly guessed - what I had was:

              merge 1:1 mykeyvar using anotherfile, update

              What I was hoping to find was some kind of detailed output as a result of the merge saying for variable A there were conflicts at observation numbers 3,200, 450 and for variable G there were conflicts at observations 67,68, 300, etc.

              I think I could re-merge the files only for the common variables and force the using value as you suggested, and then compare the two merged files - that should give me the answer I am looking for. If anyone reading this happens to know of something already written that does that please let me know because I am still curious. If not maybe I'll try to write it myself.

              By the way, I also have been getting lists of the common variables between files using something I installed from ssc called cfvars -- super, super useful and it has saved me a ton of time, so thanks a ton to whoever wrote that.

              Comment


              • #8
                Sorry to re-open an old topic, however I've had the same problem of Cara and the solution I found was based on this post.

                Originally posted by Clyde Schechter View Post
                So, you are correct that the merged data set will contain the values from the master data set only. Had you used the -replace- option you would have the values from the using data set only. To get them both, you need to rename them in one of the two data sets. So starting from your merged data, I would do the following steps. You don't say what the merge key variables were. For the sake of concreteness, I'll assume there was just one and I'm just going to call it merge_key. Mark up the code according to the actual variable(s) used for merging.
                Code:
                keep if _merge == 5
                
                // GET THE NAMES OF VARIALBES IN THE USING DATA SET
                // OTHER THAN THE MERGE KEY
                preserve
                use using_data, clear
                ds merge_key, not
                local using_vars `r(varlist)'
                
                // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET
                restore
                ds merge_key, not
                local master_vars `r(varlist)'
                local common: list master_vars & using_vars
                
                // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET
                foreach v of varlist `common' {
                rename `v' `v'_master
                }
                
                // NOW DO THE MERGE AGAIN
                merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, ///
                keep(match) assert(match using)
                And from here you can do whatever you want to do (graphs, summary statistics, browsing, whatever) to compare each the variables listed in `common' (which now come from the using data set) with the corresponding values in the master data set bearing the same name with _master at the end.
                However I had to make an amendment to this procedure: it seems to me that, in this way, the new master dataset also include the using one, thus the list of common variables coincide with the ones of the using dataset. I think the right procedure is this one:

                use master_data
                merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data
                keep if _merge == 5
                keep identifier /* in the case of 1:1 or 1:m matches, the identifier is the merge_key */
                save conflicting_data

                use master_data, clear
                merge 1:1 identifier using conflicting_data
                keep if _merge==3
                drop _merge
                then continuing as in the example above:

                Code:
                // GET THE NAMES OF VARIABLES IN THE USING DATA SET
                // OTHER THAN THE MERGE KEY
                preserve
                use using_data, clear
                ds merge_key, not
                local using_vars `r(varlist)' // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET
                restore
                ds merge_key, not
                local master_vars `r(varlist)'
                local common: list master_vars & using_vars // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET
                foreach v of varlist `common' { rename `v' `v'_master } // NOW DO THE MERGE AGAIN
                merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, /// keep(match) assert(match using)
                Finally, to list the variables in common (both with their original name and with "_master" added):

                ds`common'
                However, when common variables are many, this check can be very time-consuming.

                You may also find the SSC program cfout helpful, which compares two datasets and outputs a dataset of differences.
                Another interesting command is "cf3" (also meant to compare two datasets). However, when we use a m:1 or a 1:m matching, we must replicate each observation i on the smaller dataset m_i times, with m_i being the number of observations in the larger dataset corresponding to observation i in the smaller one. For example:

                use master_data
                merge m:1 merge_key using using.data, update replace
                save using_data_large

                set more off

                use master_data, clear
                cf3 _all using using_data_large, id (identifier)

                Comment

                Working...
                X