Merging datasets, cannot find proper key varible

Santiago Molano

Join Date: Jun 2023

Posts: 25
#1

Merging datasets, cannot find proper key varible

04 Jun 2023, 10:17

Hello! I have been attempting to merge two datasets for a while now, namely the NAVCO2.1 dataset with Historical Index of Ethnic Fractionalization dataset. My goal here is to add the efindex variable to the NAVCO2.1 dataset. The problem is that the former dataset only has two other variables which are country and year, neither of which uniquely identifies the efindex observations. In the case of my other dataset, country and year also do not uniquely identify observations so m:1 1:1 and 1:m merges havent worked thus far. For these reasons I would like to ask for some help on how to add the efindex variable to the NAVCO 2.1 dataset. Here are some screenshots for more detail:

Attached Files
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

04 Jun 2023, 10:33

In the screenshots of the two data sets you show, the variables country and year do uniquely identify observations in both data sets. So, for what you have shown, I would expect -merge 1:1 country year - to work. Please post back with new examples where country and year do not uniquely identify observations, and, of course, include as many variables as possible so we can hunt for possible solutions to your problem.

Code:

duplicates tag country year, gen(flag) browse if flag

will enable you to rapidly identify these.

In finding such examples, you may discover that they are errors in data management such as the inappropriate repetition of the same observation multiple times, or worse, the inclusion of multiple observations with the same country and year but contradictory information on other variables. Or sometimes they turn out to be a batch of empty observations that have missing values in all variables, including country and year. In that case, the solution to your problem will consist of correcting those data errors.

When doing that, please DO NOT use screenshots. Screenshots cannot be imported into Stata for development and testing of code. The helpful way to show example data is with the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Santiago Molano

Join Date: Jun 2023

Posts: 25
#3

04 Jun 2023, 11:08

Originally posted by Clyde Schechter View Post

In the screenshots of the two data sets you show, the variables country and year do uniquely identify observations in both data sets. So, for what you have shown, I would expect -merge 1:1 country year - to work. Please post back with new examples where country and year do not uniquely identify observations, and, of course, include as many variables as possible so we can hunt for possible solutions to your problem.

Code:

duplicates tag country year, gen(flag) browse if flag

will enable you to rapidly identify these.

In finding such examples, you may discover that they are errors in data management such as the inappropriate repetition of the same observation multiple times, or worse, the inclusion of multiple observations with the same country and year but contradictory information on other variables. Or sometimes they turn out to be a batch of empty observations that have missing values in all variables, including country and year. In that case, the solution to your problem will consist of correcting those data errors.

When doing that, please DO NOT use screenshots. Screenshots cannot be imported into Stata for development and testing of code. The helpful way to show example data is with the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Thank you for your reply! I provide an example from the Historical Index of Ethnic Fractionalization here including the flag variable:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str37 country int year float efindex "United Kingdom" 1945 .309 "United Kingdom" 1946 .309 "United Kingdom" 1947 .309 "United Kingdom" 1948 .309 "United Kingdom" 1949 .309 "United Kingdom" 1950 .309 "United Kingdom" 1951 .309 "United Kingdom" 1952 .309 "United Kingdom" 1953 .309 "United Kingdom" 1954 .309 end

This is what I obtain when I attempt to merge the datasets
merge 1:1 year country using "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta"
variables year country do not uniquely identify observations in the master data
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

04 Jun 2023, 12:01

This is what I obtain when I attempt to merge the datasets
merge 1:1 year country using "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta"
variables year country do not uniquely identify observations in the master data

OK, so the problem lies somewhere in HIEF.dta. But the example you show does not illustrate the problem--in the example it is definitely the case that country and year uniquely identify observations.

Run this:

Code:

use "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta", clear duplicates tag year country, gen(flag) browse if flag

and see what that turns up. Since this data set contains only country, year, and efindex, any occurrence of two or more observations having the same values for country and year must necessarily be a data error. A country can only have one efindex value in any given year, right? So if you have two different observations of country and year they are either redundant (if the observations show the same value of efindex), empty (a block of observations with country or year missing), or contradictory (if they give different values of efindex).

For the redundant ones, it is simple enough to get rid of them with -duplicates drop- (N.B. Do NOT use the -force option on this command.) But a better idea is to review how this data set was created: clearly there was some error in the data management that let duplicates slip in. (Perhaps they were there in the original source data you started from, in which case the error was somebody else's and you should let them know about it.) And where there is one error discovered, others may lurk unseen. So do a thorough review and fix whatever errors you turn up.

A block of completely empty (both country and year and all other variables are missing) observations is easy enough to just -drop-, but, again, it is better to trace the data management back to find out how they got there in the first place, and fix whatever errors you find. Partially empty observations are more problematic: they indicate a more serious problem, because you need to find out what the appropriate replacement values for the missing variables would be, and, again, find out how the problem arose in the first place.

For contradictory pairs, matters are more complicated as you have to find out which of the contradictory values of efindex, if any, is correct, and then fix up one of the observations and eliminate the others. Again, though, before just patching the data, it is best to trace back the data management that created it and fix the errors that led to the problem in the first place. And again, if the errors trace back to the original source data, then you should notify the provider of that data and ask them to fix things.
Comment
Santiago Molano

Join Date: Jun 2023

Posts: 25
#5

04 Jun 2023, 12:30

Originally posted by Clyde Schechter View Post

OK, so the problem lies somewhere in HIEF.dta. But the example you show does not illustrate the problem--in the example it is definitely the case that country and year uniquely identify observations.

Run this:

Code:

use "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta", clear duplicates tag year country, gen(flag) browse if flag

and see what that turns up. Since this data set contains only country, year, and efindex, any occurrence of two or more observations having the same values for country and year must necessarily be a data error. A country can only have one efindex value in any given year, right? So if you have two different observations of country and year they are either redundant (if the observations show the same value of efindex), empty (a block of observations with country or year missing), or contradictory (if they give different values of efindex).

For the redundant ones, it is simple enough to get rid of them with -duplicates drop- (N.B. Do NOT use the -force option on this command.) But a better idea is to review how this data set was created: clearly there was some error in the data management that let duplicates slip in. (Perhaps they were there in the original source data you started from, in which case the error was somebody else's and you should let them know about it.) And where there is one error discovered, others may lurk unseen. So do a thorough review and fix whatever errors you turn up.

A block of completely empty (both country and year and all other variables are missing) observations is easy enough to just -drop-, but, again, it is better to trace the data management back to find out how they got there in the first place, and fix whatever errors you find. Partially empty observations are more problematic: they indicate a more serious problem, because you need to find out what the appropriate replacement values for the missing variables would be, and, again, find out how the problem arose in the first place.

For contradictory pairs, matters are more complicated as you have to find out which of the contradictory values of efindex, if any, is correct, and then fix up one of the observations and eliminate the others. Again, though, before just patching the data, it is best to trace back the data management that created it and fix the errors that led to the problem in the first place. And again, if the errors trace back to the original source data, then you should notify the provider of that data and ask them to fix things.

I ran the recommended commands, here is what I obtained:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str37 country int year float efindex "Rwanda" 1991 .228 "Rwanda" 1991 .228 "Lesotho" 1978 .274 "Lesotho" 1978 .274 "Haiti" 1978 .107 "Haiti" 1978 .107 "Honduras" 1978 .175 "Honduras" 1978 .175 "Honduras" 1981 .179 "Honduras" 1981 .179 "Paraguay" 2000 .158 "Paraguay" 2000 .158 "Argentina" 1983 .097 "Argentina" 1983 .097 "Argentina" 2000 .129 "Argentina" 2000 .129 "Argentina" 2000 .129 "Uruguay" 2006 .18 "Uruguay" 2006 .18 "Turkey" 1965 .151 "Turkey" 1965 .151 "Lebanon" 1993 .13 "Lebanon" 1993 .13 "Jordan" 1982 .035 "Jordan" 1982 .035 "Jordan" 2000 .039 "Jordan" 2000 .039 "Jordan" 2000 .039 "Israel" 1983 .28 "Israel" 1983 .28 "Israel" 1986 .289 "Israel" 1986 .289 "Republic of Korea" 1970 0 "Republic of Korea" 1970 0 "Republic of Korea" 2000 .017 "Republic of Korea" 2000 .017 "Republic of Korea" 2000 .017 "Japan" 1984 .013 "Japan" 1984 .013 "Sri Lanka" 1953 .467 "Sri Lanka" 1953 .467 "Sri Lanka" 1971 .423 "Sri Lanka" 1971 .423 "Sri Lanka" 1971 .423 end

Seems like the problem consists of repeated observations. If I understand correctly, what I should do now is just dropping the repeated obs.
Comment
Santiago Molano

Join Date: Jun 2023

Posts: 25
#6

04 Jun 2023, 12:45

Originally posted by Clyde Schechter View Post

OK, so the problem lies somewhere in HIEF.dta. But the example you show does not illustrate the problem--in the example it is definitely the case that country and year uniquely identify observations.

Run this:

Code:

use "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta", clear duplicates tag year country, gen(flag) browse if flag

and see what that turns up. Since this data set contains only country, year, and efindex, any occurrence of two or more observations having the same values for country and year must necessarily be a data error. A country can only have one efindex value in any given year, right? So if you have two different observations of country and year they are either redundant (if the observations show the same value of efindex), empty (a block of observations with country or year missing), or contradictory (if they give different values of efindex).

For the redundant ones, it is simple enough to get rid of them with -duplicates drop- (N.B. Do NOT use the -force option on this command.) But a better idea is to review how this data set was created: clearly there was some error in the data management that let duplicates slip in. (Perhaps they were there in the original source data you started from, in which case the error was somebody else's and you should let them know about it.) And where there is one error discovered, others may lurk unseen. So do a thorough review and fix whatever errors you turn up.

A block of completely empty (both country and year and all other variables are missing) observations is easy enough to just -drop-, but, again, it is better to trace the data management back to find out how they got there in the first place, and fix whatever errors you find. Partially empty observations are more problematic: they indicate a more serious problem, because you need to find out what the appropriate replacement values for the missing variables would be, and, again, find out how the problem arose in the first place.

For contradictory pairs, matters are more complicated as you have to find out which of the contradictory values of efindex, if any, is correct, and then fix up one of the observations and eliminate the others. Again, though, before just patching the data, it is best to trace back the data management that created it and fix the errors that led to the problem in the first place. And again, if the errors trace back to the original source data, then you should notify the provider of that data and ask them to fix things.

I tried merging them again but obtained the same error message. Couldnt the problem be that for example, the efindex value .309 is repeated for the country United Kingdom across various years? In this case country and year would be having the efindex value of .309 repeated times. Perhaps this could be the cause?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

04 Jun 2023, 13:19

Well, yes, the output you show in #5 consists entirely of redundant observations. While I can't emphasize enough that I recommend you go back and review the data management that created this set, the quick fix is, indeed, -duplicates drop-.

If what you show in #5 is the complete output of the code in #4, I can see no reason that

Code:

use "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta", clear duplicates drop merge 1:1 country year using other_data_set

would give you the same error message. Perhaps the error message now says that country and year do not uniquely identify observations in the other data set?

I can see where that would arise by looking at your screenshot of the other data set in #1. Each observation there has, in addition to country and year, variables id and target which change within the series of observations for Western Sahara. In what you show, the years do not overlap for the different values of id or target, but perhaps in the full set they do, and perhaps in this data set, the unit of observation is not the country-year, but rather some smaller subunit within the country (perhaps that is what id or target is indicating). That would seem to be legitimate duplication of country-year observations. So if that is, in fact, what is going on:

Code:

use "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta", clear duplicates drop merge 1:m country year using other_data_set

will give you a correct merge.

If, however, you are certain that the other data set should also be uniquely identified by country and year, then you have to again search for the country-year duplicates and deal with them. So, in this case, just apply the same approach I recommended in #4 to the other data set and fix up whatever is causing the problem.

Added:

Couldnt the problem be that for example, the efindex value .309 is repeated for the country United Kingdom across various years? In this case country and year would be having the efindex value of .309 repeated times. Perhaps this could be the cause?

No. As long as each observation for United Kingdom has a different year, it doesn't matter whether the value of efindex is the same in different years. -merge- only cares about unique values for the combinations of the key variables you specify in the command: everything else is unconstrained.

Last edited by Clyde Schechter; 04 Jun 2023, 13:22.
Comment
Santiago Molano

Join Date: Jun 2023

Posts: 25
#8

05 Jun 2023, 02:45

Originally posted by Clyde Schechter View Post

Well, yes, the output you show in #5 consists entirely of redundant observations. While I can't emphasize enough that I recommend you go back and review the data management that created this set, the quick fix is, indeed, -duplicates drop-.

If what you show in #5 is the complete output of the code in #4, I can see no reason that

Code:

use "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta", clear duplicates drop merge 1:1 country year using other_data_set

would give you the same error message. Perhaps the error message now says that country and year do not uniquely identify observations in the other data set?

I can see where that would arise by looking at your screenshot of the other data set in #1. Each observation there has, in addition to country and year, variables id and target which change within the series of observations for Western Sahara. In what you show, the years do not overlap for the different values of id or target, but perhaps in the full set they do, and perhaps in this data set, the unit of observation is not the country-year, but rather some smaller subunit within the country (perhaps that is what id or target is indicating). That would seem to be legitimate duplication of country-year observations. So if that is, in fact, what is going on:

Code:

use "C:\Users\smola\OneDrive\Desktop\Stata\Dataset s Auth Con\HIEF.dta", clear duplicates drop merge 1:m country year using other_data_set

will give you a correct merge.

If, however, you are certain that the other data set should also be uniquely identified by country and year, then you have to again search for the country-year duplicates and deal with them. So, in this case, just apply the same approach I recommended in #4 to the other data set and fix up whatever is causing the problem.

Added:

No. As long as each observation for United Kingdom has a different year, it doesn't matter whether the value of efindex is the same in different years. -merge- only cares about unique values for the combinations of the key variables you specify in the command: everything else is unconstrained.

It worked! Thank you for the help and have a good day!
Comment

Announcement

Merging datasets, cannot find proper key varible

Comment

Comment

Comment

Comment

Comment

Comment

Comment