How do I determine which are the problem variables after a merge with several "nonmissing conflict"s.

Cara Nichole

Join Date: Mar 2015

Posts: 14
#1

How do I determine which are the problem variables after a merge with several "nonmissing conflict"s.

09 Apr 2015, 04:49

Hi Everyone,

I just merged two files (using Stata 11 on a mac) which resulted 23 instances of "nonmissing conflict" out of 500, where _merge = 5. There were a handful of overlapping variables in both datasets.

I would like to find which were the variables that conflicted. I can weed out which observations using _merge = 5, but is there a way to determine which variables were the issue? I am assuming that the resultant file includes only values from the master dataset (I used update, but NOT replace), but I want to double check all of these values in both files since there is a conflict.

Thanks for any help!
Cara
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

09 Apr 2015, 10:14

So, you are correct that the merged data set will contain the values from the master data set only. Had you used the -replace- option you would have the values from the using data set only. To get them both, you need to rename them in one of the two data sets. So starting from your merged data, I would do the following steps. You don't say what the merge key variables were. For the sake of concreteness, I'll assume there was just one and I'm just going to call it merge_key. Mark up the code according to the actual variable(s) used for merging.

Code:

keep if _merge == 5 // GET THE NAMES OF VARIALBES IN THE USING DATA SET // OTHER THAN THE MERGE KEY preserve use using_data, clear ds merge_key, not local using_vars `r(varlist)' // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET restore ds merge_key, not local master_vars `r(varlist)' local common: list master_vars & using_vars // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET foreach v of varlist `common' { rename `v' `v'_master } // NOW DO THE MERGE AGAIN merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, /// keep(match) assert(match using)

And from here you can do whatever you want to do (graphs, summary statistics, browsing, whatever) to compare each the variables listed in `common' (which now come from the using data set) with the corresponding values in the master data set bearing the same name with _master at the end.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#3

09 Apr 2015, 10:39

So, you are correct that the merged data set will contain the values from the master data set only. Had you used the -replace- option you would have the values from the using data set only. To get them both, you need to rename them in one of the two data sets. So starting from your merged data, I would do the following steps. You don't say what the merge key variables were. For the sake of concreteness, I'll assume there was just one and I'm just going to call it merge_key. Mark up the code according to the actual variable(s) used for merging.

Code:

keep if _merge == 5 // GET THE NAMES OF VARIALBES IN THE USING DATA SET // OTHER THAN THE MERGE KEY preserve use using_data, clear ds merge_key, not local using_vars `r(varlist)' // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET restore ds merge_key, not local master_vars `r(varlist)' local common: list master_vars & using_vars // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET foreach v of varlist `common' { rename `v' `v'_master } // NOW DO THE MERGE AGAIN merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, /// keep(match) assert(match using)

And from here you can do whatever you want to do (graphs, summary statistics, browsing, whatever) to compare each the variables listed in `common' (which now come from the using data set) with the corresponding values in the master data set bearing the same name with _master at the end.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#4

09 Apr 2015, 10:50

So, you are correct that the merged data set will contain the values from the master data set only. Had you used the -replace- option you would have the values from the using data set only. To get them both, you need to rename them in one of the two data sets. So starting from your merged data, I would do the following steps. You don't say what the merge key variables were. For the sake of concreteness, I'll assume there was just one and I'm just going to call it merge_key. Mark up the code according to the actual variable(s) used for merging.

Code:

keep if _merge == 5 // GET THE NAMES OF VARIALBES IN THE USING DATA SET // OTHER THAN THE MERGE KEY preserve use using_data, clear ds merge_key, not local using_vars `r(varlist)' // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET restore ds merge_key, not local master_vars `r(varlist)' local common: list master_vars & using_vars // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET foreach v of varlist `common' { rename `v' `v'_master } // NOW DO THE MERGE AGAIN merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, /// keep(match) assert(match using)

And from here you can do whatever you want to do (graphs, summary statistics, browsing, whatever) to compare each the variables listed in `common' (which now come from the using data set) with the corresponding values in the master data set bearing the same name with _master at the end.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#5

10 Apr 2015, 08:27

Oops! Apologies for the multiple posts! The forum server was even less responsive than usual yesterday. Normally when I get hung on the "working" message I open a new browser window with the forum. That window usually shows that my post is there. So then I just close the original window. In the past that has prevented multiple posting. But yesterday, when I opened the new windows, there was no evidence that my post had gotten through, so I tried again. Sorry, sorry, sorry. It wastes everybody's time when posts are multiplied like that.
Comment
Matthew White

Join Date: Apr 2014

Posts: 29
#6

13 Apr 2015, 07:30

You may also find the SSC program cfout helpful, which compares two datasets and outputs a dataset of differences.
Comment
Cara Nichole

Join Date: Mar 2015

Posts: 14
#7

13 Apr 2015, 07:36

Thanks Clyde!

You weren't the only one who sensed the server was less responsive than usual, hence my late reply.

I'm not sure if your suggestion 100% clears things up for me - although it did clear something else up for me, unrelated, so thanks also for that. I did merge using only 1 key variable, as you correctly guessed - what I had was:

merge 1:1 mykeyvar using anotherfile, update

What I was hoping to find was some kind of detailed output as a result of the merge saying for variable A there were conflicts at observation numbers 3,200, 450 and for variable G there were conflicts at observations 67,68, 300, etc.

I think I could re-merge the files only for the common variables and force the using value as you suggested, and then compare the two merged files - that should give me the answer I am looking for. If anyone reading this happens to know of something already written that does that please let me know because I am still curious. If not maybe I'll try to write it myself.

By the way, I also have been getting lists of the common variables between files using something I installed from ssc called cfvars -- super, super useful and it has saved me a ton of time, so thanks a ton to whoever wrote that.
Comment
Federico Tedeschi

Join Date: Mar 2015

Posts: 137
#8

16 Nov 2017, 08:55

Sorry to re-open an old topic, however I've had the same problem of Cara and the solution I found was based on this post.

Originally posted by Clyde Schechter View Post

So, you are correct that the merged data set will contain the values from the master data set only. Had you used the -replace- option you would have the values from the using data set only. To get them both, you need to rename them in one of the two data sets. So starting from your merged data, I would do the following steps. You don't say what the merge key variables were. For the sake of concreteness, I'll assume there was just one and I'm just going to call it merge_key. Mark up the code according to the actual variable(s) used for merging.

Code:

keep if _merge == 5 // GET THE NAMES OF VARIALBES IN THE USING DATA SET // OTHER THAN THE MERGE KEY preserve use using_data, clear ds merge_key, not local using_vars `r(varlist)' // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET restore ds merge_key, not local master_vars `r(varlist)' local common: list master_vars & using_vars // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET foreach v of varlist `common' { rename `v' `v'_master } // NOW DO THE MERGE AGAIN merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, /// keep(match) assert(match using)

And from here you can do whatever you want to do (graphs, summary statistics, browsing, whatever) to compare each the variables listed in `common' (which now come from the using data set) with the corresponding values in the master data set bearing the same name with _master at the end.

However I had to make an amendment to this procedure: it seems to me that, in this way, the new master dataset also include the using one, thus the list of common variables coincide with the ones of the using dataset. I think the right procedure is this one:

use master_data
merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data
keep if _merge == 5
keep identifier /* in the case of 1:1 or 1:m matches, the identifier is the merge_key */
save conflicting_data

use master_data, clear
merge 1:1 identifier using conflicting_data
keep if _merge==3
drop _merge

then continuing as in the example above:

Code:

// GET THE NAMES OF VARIABLES IN THE USING DATA SET // OTHER THAN THE MERGE KEY preserve use using_data, clear ds merge_key, not local using_vars `r(varlist)' // NOW FIND VARIABLES COMMON WITH THE MASTER DATA SET restore ds merge_key, not local master_vars `r(varlist)' local common: list master_vars & using_vars // NOW RENAME THE COMMON VARIABLES IN THE MASTER DATA SET foreach v of varlist `common' { rename `v' `v'_master } // NOW DO THE MERGE AGAIN merge 1:1 /* OR m:1 OR 1:m, WHICHEVER WAS DONE ORIGINALLY */ merge_key using using_data, /// keep(match) assert(match using)

Finally, to list the variables in common (both with their original name and with "_master" added):

ds`common'

However, when common variables are many, this check can be very time-consuming.

You may also find the SSC program cfout helpful, which compares two datasets and outputs a dataset of differences.

Another interesting command is "cf3" (also meant to compare two datasets). However, when we use a m:1 or a 1:m matching, we must replicate each observation i on the smaller dataset m_i times, with m_i being the number of observations in the larger dataset corresponding to observation i in the smaller one. For example:

use master_data
merge m:1 merge_key using using.data, update replace
save using_data_large

set more off

use master_data, clear
cf3 _all using using_data_large, id (identifier)
Comment

Announcement

How do I determine which are the problem variables after a merge with several "nonmissing conflict"s.

Comment

Comment

Comment

Comment

Comment

Comment

Comment