Problem with merge

Eugenio Cassese

Join Date: Jun 2023
Posts: 11

Problem with merge

10 Jul 2023, 16:02

I have 2 datasets, which I want to merge according to the variables codistat_municip and year

the first one is :

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long codistat_municip int year double pop_born_tot_f
1001 1991  3
1001 1992 10
1001 1993  9
1001 1994  7
1001 1995 11
1001 1996  7
1001 1997  8
1001 1998  9
1001 1999  7
1001 2000  9
1001 2001  6
1002 1991  1
1002 1992 23
1002 1993 12
1002 1994 15
1002 1995 21
1002 1996 10
1002 1997 23
1002 1998 15
1002 1999 11
1002 2000 25
1002 2001 14
1003 1991  0
1003 1992  1
1003 1993  3
1003 1994  1
1003 1995  1
1003 1996  4
1003 1997  2
1003 1998  2
1003 1999  6
1003 2000  2
1003 2001  1
1004 1991  1
1004 1992  7
1004 1993  4
1004 1994  4
1004 1995  7
1004 1996  8
1004 1997  8
1004 1998  4
1004 1999  7
1004 2000  5
1004 2001  4
1005 1991  1
1005 1992  1
1005 1993  2
1005 1994  2
1005 1995  3
1005 1996  4
1005 1997  2
1005 1998  1
1005 1999  1
1005 2000  2
1005 2001  2
1006 1991  3
1006 1992 17
1006 1993 18
1006 1994 15
1006 1995 24
1006 1996 16
1006 1997 24
1006 1998 17
1006 1999 30
1006 2000  9
1006 2001 23
1007 1991  0
1007 1992  1
1007 1993  2
1007 1994  2
1007 1995  2
1007 1996  0
1007 1997  0
1007 1998  3
1007 1999  0
1007 2000  0
1007 2001  0
1008 1991  7
1008 1992 73
1008 1993 92
1008 1994 72
1008 1995 82
1008 1996 75
1008 1997 69
1008 1998 60
1008 1999 62
1008 2000 56
1008 2001 52
1009 1991  1
1009 1992  6
1009 1993  7
1009 1994  5
1009 1995 10
1009 1996  7
1009 1997 10
1009 1998  6
1009 1999  8
1009 2000 13
1009 2001  2
1010 1991  0
end

the second one is:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long codistat_municip float year double pop_tot_0_f
1001 1982  5
1001 1983  8
1001 1984 10
1001 1985  4
1001 1986  6
1001 1987 14
1001 1988  6
1001 1989 14
1001 1990 10
1001 1991 11
1001 1992  7
1001 1993  9
1001 1994 11
1001 1995  8
1001 1996 12
1001 1997  8
1001 1998  6
1001 1999 10
1001 2000  7
1001 2001  7
1001 2002  8
1001 2003 10
1001 2004 13
1001 2005  6
1001 2006  5
1001 2007  7
1001 2008 10
1001 2009 16
1001 2010  9
1001 2011 10
1001 2012 11
1001 2013 10
1001 2014 15
1001 2015  9
1001 2016 11
1001 2017  9
1001 2018  7
1001 2019 15
1001 2020 11
1001 2021  7
1001 2022  7
1002 1982 25
1002 1983 27
1002 1984 26
1002 1985 20
1002 1986 27
1002 1987 24
1002 1988 15
1002 1989 23
1002 1990 18
1002 1991 22
1002 1992 19
1002 1993 20
1002 1994 14
1002 1995 14
1002 1996 23
1002 1997 12
1002 1998 22
1002 1999 14
1002 2000 10
1002 2001 23
1002 2002 19
1002 2003 21
1002 2004 14
1002 2005 21
1002 2006 19
1002 2007 13
1002 2008 23
1002 2009 20
1002 2010 22
1002 2011 23
1002 2012 22
1002 2013 10
1002 2014 11
1002 2015 21
1002 2016 15
1002 2017 18
1002 2018 12
1002 2019  7
1002 2020 11
1002 2021  9
1002 2022  6
1003 1982  2
1003 1983  0
1003 1984  0
1003 1985  1
1003 1986  3
1003 1987  0
1003 1988  1
1003 1989  0
1003 1990  3
1003 1991  2
1003 1992  2
1003 1993  2
1003 1994  3
1003 1995  1
1003 1996  0
1003 1997  4
1003 1998  2
1003 1999  1
end

I want to perform a merge 1:1 through the code

merge 1:1 codistat_municip year etc...

However, STATA returns the error "variables codistat_municip year do not uniquely identify observations in the
using data"

What should I do?

Tags: None

Eugenio Cassese

Join Date: Jun 2023

Posts: 11
#2

10 Jul 2023, 16:22

I actually solved the problem of unique identification (there were duplicates), but still the fact that in a dataset there are years that are not contained in the another datasets reults in the loss of a lot of information. How can I solve it?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#3

10 Jul 2023, 16:32

Well, your example data do not exhibit this problem, and they -merge 1:1- quite nicely.

Presumably, however, in the full data sets, you have encountered this problem. Stata has pointed out that the data in the using data set is not what you think it is. So your first task is to find the offending observations that have the same values of codistat_municip and year. To do that:

Code:

duplicates codistat_municip year, gen(flag) browse if flag

and you will see them. Then you must inspect them and decide how to resolve the problem. There are several possibilities:
The observations are pure duplicates. That is, in addition to agreeing on codistat_municip and year, they also agree on all the other variables. This type of redundancy in a data set usually is indicative of an error made in the creation of this data set. So you should review the data management that led to the creation of this data set and fix the error that caused the erroneous inclusion of the surplus observations. You may encounter other errors in the course of doing that, and, needless to say, you should fix those as well. If you do not have access to the code that created this data set, then since the observations agree on everything, you can just eliminate them by running -duplicates drop-. (You should, however, notify whoever did create the data set that there is this problem.)

The observations disagree on other variables. This is a more serious problem because you have contradictory data on the same cities at the same time. You need to figure out how to resolve these discrepancies. It may be possible to identify one among the surplus observations as correct and remove the others. Or it may be necessary to somehow combine the data from the different observations into a single observation for the city in that year. Or it may be impossible to know what the correct values for the other variables are, in which case you really have to delete all of these observations (or replace the variables with missing values). Another possibility is that all of the apparently conflicting observations are all correct, and the observations can be distinguished by some other variable(s). So they are not really in conflict after all. For example, the different observations may refer to different parts of the city, or different months or quarters of the year, or something like that. In that case, you can do a 1:m merge and that will give you meaningful data to work with.
Comment
Eugenio Cassese

Join Date: Jun 2023

Posts: 11
#4

10 Jul 2023, 16:43

Originally posted by Clyde Schechter View Post

Well, your example data do not exhibit this problem, and they -merge 1:1- quite nicely.

Presumably, however, in the full data sets, you have encountered this problem. Stata has pointed out that the data in the using data set is not what you think it is. So your first task is to find the offending observations that have the same values of codistat_municip and year. To do that:

Code:

duplicates codistat_municip year, gen(flag) browse if flag

and you will see them. Then you must inspect them and decide how to resolve the problem. There are several possibilities:
The observations are pure duplicates. That is, in addition to agreeing on codistat_municip and year, they also agree on all the other variables. This type of redundancy in a data set usually is indicative of an error made in the creation of this data set. So you should review the data management that led to the creation of this data set and fix the error that caused the erroneous inclusion of the surplus observations. You may encounter other errors in the course of doing that, and, needless to say, you should fix those as well. If you do not have access to the code that created this data set, then since the observations agree on everything, you can just eliminate them by running -duplicates drop-. (You should, however, notify whoever did create the data set that there is this problem.)

The observations disagree on other variables. This is a more serious problem because you have contradictory data on the same cities at the same time. You need to figure out how to resolve these discrepancies. It may be possible to identify one among the surplus observations as correct and remove the others. Or it may be necessary to somehow combine the data from the different observations into a single observation for the city in that year. Or it may be impossible to know what the correct values for the other variables are, in which case you really have to delete all of these observations (or replace the variables with missing values). Another possibility is that all of the apparently conflicting observations are all correct, and the observations can be distinguished by some other variable(s). So they are not really in conflict after all. For example, the different observations may refer to different parts of the city, or different months or quarters of the year, or something like that. In that case, you can do a 1:m merge and that will give you meaningful data to work with.

I actually solved the problem of unique identification (there were duplicates), but still the fact that in a dataset there are years that are not contained in the another datasets reults in the loss of a lot of information
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#5

10 Jul 2023, 17:09

I actually solved the problem of unique identification (there were duplicates), but still the fact that in a dataset there are years that are not contained in the another datasets reults in the loss of a lot of information

That is not true, unless you have asked Stata to discard information. When you -merge- two data sets in Stata, all of the information in both data sets is retained in the final result unless you specifically ask Stata to discard unmatched observations using the -keep()- option. So no data is lost by doing the -merge-. You end up with less information, perhaps, than you hoped to have, and these unmatched observations may be unusable in estimation commands you will do with the merged data, and that, is, indeed a loss of information. But -merge- itself never loses information unless you tell it to.

What can you do about the missing information in the unmatched observations? Well, depending on what these data are, there may be acceptable ways to impute the missing values. If you are planning on fitting regression models, then you might look into doing multiple imputation of the missing data, if it is reasonable to assume that the data are missing at random. See -help mi-. If you are going to do other analyses that are not supported with multiple imputation, if there are only short spells of unmatched years within the cities, and if it is reasonable to assume linear trends in the pop* variables over time, you might consider interpolation. See -help ipolate-.

The ideal solution, unfortunately, is seldom feasible, but worth mentioning anyway: get better, more complete data sets!
Comment

Announcement

Problem with merge

Comment

Comment

Comment

Comment