Problem with Merging Panel Data (Country Year)

Roger More

Join Date: Jul 2017
Posts: 59

Problem with Merging Panel Data (Country Year)

27 Jul 2017, 17:22

Dear Stata List Members,

I wanted to combine variables of two dataset in one big dataset. Upon researching, I think we have to use the merge command. I have implemented it with sample data in help files but I am having difficulties merge two panel datasets both in long format in my context.

I get the error "variables CountryCode Year do not uniquely identify observations in the using data". I have read the help documentation of Stata on merge and I think I just need a 1:1 merge with CountryCode and Year being the common identifiers. I also have a hunch that the problem is because the countries are repeated and that is causing the problem. However, I am unable over come the problem to obtain what I want.

My "using" dataset sample is as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str52 CountryName str3 CountryCode int Year double(AG_AGR_TRAC_NO_wb AG_CON_FERT_PT_ZS_wb)
"Ecuador" "ECU" 1969 2900 .
"Ecuador" "ECU" 1970 3100 .
"Ecuador" "ECU" 1971 3400 .
"Ecuador" "ECU" 1972 3700 .
"Ecuador" "ECU" 1973 4200 .
"Ecuador" "ECU" 1974 4945 .
end

Sample "Master" Dataset sample is as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 CountryCode str34 CountryName str29 currency_unit int Year float(rgdpe_pwt rgdpo_pwt) double pop_pwt
"ECU" "Ecuador" "US Dollar" 1964 14246.126 14603.285 5.079177674303292
"ECU" "Ecuador" "US Dollar" 1965 14654.083  15095.12  5.23261926779025
"ECU" "Ecuador" "US Dollar" 1966 15186.637  15549.03 5.390954271694549
"ECU" "Ecuador" "US Dollar" 1967 16306.564 16674.877 5.554325046865233
"ECU" "Ecuador" "US Dollar" 1968 17071.074 17443.906 5.722517076954425
"ECU" "Ecuador" "US Dollar" 1969 17508.463 17877.787 5.895319745911485
"ECU" "Ecuador" "US Dollar" 1970 18884.623 19135.959           6.07252
"ECU" "Ecuador" "US Dollar" 1971 19630.855 20283.936          6.248831
"ECU" "Ecuador" "US Dollar" 1972 20757.283  22304.63 6.428706999999999
end

I have tried the following code but get the error ""variables CountryCode Year do not uniquely identify observations in the using data":

Code:

use ".\pwt90.dta", clear
merge m:1 CountryCode Year using ".\FullWorldBankinStata.dta"  and merge 1:1 CountryCode Year using ".\FullWorldBankinStata.dta"

So, how might I merge the variables of two datasets in one dataset according to country year in this case? Thank you in advance for you help.

Cheers,
Roger

Last edited by Roger More; 27 Jul 2017, 17:27.

Tags: None

Friedrich Huebler

Join Date: Apr 2014
Posts: 1053

27 Jul 2017, 19:52

Thank you for using dataex. The excerpts that you shared are apparently not representative of your full datasets because in the excerpts the variables CountryCode and Year do uniquely identify all observations.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str52 CountryName str3 CountryCode int Year double(AG_AGR_TRAC_NO_wb AG_CON_FERT_PT_ZS_wb)
"Ecuador" "ECU" 1969 2900 .
"Ecuador" "ECU" 1970 3100 .
"Ecuador" "ECU" 1971 3400 .
"Ecuador" "ECU" 1972 3700 .
"Ecuador" "ECU" 1973 4200 .
"Ecuador" "ECU" 1974 4945 .
end

save "FullWorldBankinStata.dta", replace

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 CountryCode str34 CountryName str29 currency_unit int Year float(rgdpe_pwt rgdpo_pwt) double pop_pwt
"ECU" "Ecuador" "US Dollar" 1964 14246.126 14603.285 5.079177674303292
"ECU" "Ecuador" "US Dollar" 1965 14654.083  15095.12  5.23261926779025
"ECU" "Ecuador" "US Dollar" 1966 15186.637  15549.03 5.390954271694549
"ECU" "Ecuador" "US Dollar" 1967 16306.564 16674.877 5.554325046865233
"ECU" "Ecuador" "US Dollar" 1968 17071.074 17443.906 5.722517076954425
"ECU" "Ecuador" "US Dollar" 1969 17508.463 17877.787 5.895319745911485
"ECU" "Ecuador" "US Dollar" 1970 18884.623 19135.959           6.07252
"ECU" "Ecuador" "US Dollar" 1971 19630.855 20283.936          6.248831
"ECU" "Ecuador" "US Dollar" 1972 20757.283  22304.63 6.428706999999999
end

save "pwt90.dta", replace

The two datasets can be merged without an error message.

Code:

. use "pwt90.dta"
. merge 1:1 CountryCode Year using "FullWorldBankinStata.dta"
(note: variable CountryName was str34, now str52 to accommodate using data's values)

    Result                           # of obs.
    -----------------------------------------
    not matched                             7
        from master                         5  (_merge==1)
        from using                          2  (_merge==2)

    matched                                 4  (_merge==3)
    -----------------------------------------

Try this command to identify problematic observations:

Code:

duplicates list CountryCode Year

Comment

Roger More

Join Date: Jul 2017
Posts: 59

28 Jul 2017, 06:09

Dear Fredrich,

Thanks for your reply. In the master data I get that there are no duplicates.

Code:

. merge 1:1 CountryCode Year using ".\FullWorldBankinStata.dta"

variables CountryCode Year do not uniquely identify observations in the using data
r(459);

end of do-file

. duplicates list CountryCode Year

Duplicates in terms of CountryCode Year

(0 observations are duplicates)

Now, below I try to post larger sample of data or rather more representative data of what I have:

Master:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 CountryCode str52 CountryName str29 currency_unit int Year float(rgdpe_pwt rgdpo_pwt) double pop_pwt
"ARM" "Armenia"             "Armenian Dram"         2011  23376.14  25124.58           2.967984
"ARM" "Armenia"             "Armenian Dram"         2012  24321.04  26265.66           2.978339
"ARM" "Armenia"             "Armenian Dram"         2013 25079.193   26804.6 2.9921919999999997
"ARM" "Armenia"             "Armenian Dram"         2014  25810.71 27573.115           3.006154
"ATG" "Antigua and Barbuda" "East Caribbean Dollar" 1950         .         .                  .
"ATG" "Antigua and Barbuda" "East Caribbean Dollar" 1951         .         .                  .
"ATG" "Antigua and Barbuda" "East Caribbean Dollar" 1952         .         .                  .
end

Using Data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str52 CountryName str3 CountryCode int Year double(AG_AGR_TRAC_NO_wb AG_CON_FERT_PT_ZS_wb AG_CON_FERT_ZS_wb)
"Armenia"        "ARM" 2010 . . 24.202898550724637
"Armenia"        "ARM" 2011 . . 27.368655387355297
"Armenia"        "ARM" 2012 . .   33.7042818911686
"Armenia"        "ARM" 2013 . .  45.19634091923248
"Armenia"        "ARM" 2014 . .  42.26815642458101
"Armenia"        "ARM" 2015 . .                  .
"Armenia"        "ARM" 2016 . .                  .
"American Samoa" "ASM" 1960 . .                  .
"American Samoa" "ASM" 1961 3 .                  .
"American Samoa" "ASM" 1962 3 .                  .
"Cayman Islands" "CYM" 2015    . . .
"Cayman Islands" "CYM" 2016    . . .
"Cyprus"         "CYP" 1960    . . .
"Cyprus"         "CYP" 1961 2880 . .
"Cyprus"         "CYP" 1962 3600 . .
"Cyprus"         "CYP" 1963 3990 . .
"Cyprus"         "CYP" 1964 4574 . .
"Cyprus"         "CYP" 1965 4718 . .
end

However, I do get duplicates in using data:

Code:

Output
Duplicates in terms of CountryCode Year

  +----------------------------------+
  | group:    obs:   Count~de   Year |
  |----------------------------------|
  |      1    2794        CYP   1960 |
  |      1    5644        CYP   1960 |
  |      2    2795        CYP   1961 |
  |      2    5645        CYP   1961 |
  |      3    2796        CYP   1962 |
.
.
.
  |----------------------------------|
  |    113   11855        PAN   2015 |
  |    114   11058        PAN   2016 |
  |    114   11856        PAN   2016 |
  +----------------------------------+

If I try dropping duplicates, I get no obseration dropped:

Code:

. duplicates drop

Duplicates in terms of all variables

(0 observations are duplicates)

I am not sure how to proceed from here.

Thank you again for your help so far.

Cheers,
Roger

Last edited by Roger More; 28 Jul 2017, 06:22.

Comment

Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#4

28 Jul 2017, 07:43

If in theory there should only be one observation per country and year you have to drop the additional observations. You have to decide which data you want to keep. For example, you could look at the observations reported by duplicates list and see if only one of them has complete data. You could also ask the data provider about the duplicate observations.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#5

28 Jul 2017, 11:38

This isn't just a few duplicates in a country-year dataset. You posted:
Duplicates in terms of CountryCode Year +----------------------------------+ | group: obs: Count~de Year | |----------------------------------| | 1 2794 CYP 1960 | | 1 5644 CYP 1960 | | 2 2795 CYP 1961 | says that you have thousands of duplicates for a great many countrycode-year pairs. This suggests that the data are not what you intended at all. You need to look carefully at the using data set. It may be that you need to collapse the data by country-year or something else all together. When you have 5600 "duplicate" observations, it is likely that you don't want to just pick one.
Comment

Friedrich Huebler

Join Date: Apr 2014
Posts: 1053

28 Jul 2017, 12:45

Originally posted by Phil Bromiley View Post

...you have thousands of duplicates for a great many countrycode-year pairs.

Based on the output below there are 114 observations with one or more copies.

Code:

Duplicates in terms of CountryCode Year

  +----------------------------------+
  | group:    obs:   Count~de   Year |
  |----------------------------------|
  |      1    2794        CYP   1960 |
  |      1    5644        CYP   1960 |
  |      2    2795        CYP   1961 |
  |      2    5645        CYP   1961 |
  |      3    2796        CYP   1962 |
.
.
.
  |----------------------------------|
  |    113   11855        PAN   2015 |
  |    114   11058        PAN   2016 |
  |    114   11856        PAN   2016 |
  +----------------------------------+

Comment

Roger More

Join Date: Jul 2017
Posts: 59

28 Jul 2017, 13:28

Dear Professor Fredrich and Phil.

Thanks a lot! I indeed have 114 observation duplicates.

Upon looking at these observations carefully, I find that the higher numbered observation i.e. _n == 5644 (missing) vis-a-vis 2794 (non-missing what I want to keep) 5645 (missing that I want to drop) vis-a-vis 2796 (non-missing, I want to keep) etc

Code:

Stata Output and Code

. list CountryName CountryCode Year ER_FSH_CAPT_MT_wb if _n == 2794  , clean

         Count~me   Count~de   Year   E~CAPT~b  
 2794.     Cyprus        CYP   1960        500  

. list CountryName CountryCode Year ER_FSH_CAPT_MT_wb if _n == 5644, clean

         Count~me   Count~de   Year   E~CAPT~b  
 5644.     Cyprus        CYP   1960          .  

.
. list CountryName CountryCode Year ER_FSH_CAPT_MT_wb if _n == 5645  , clean

         Count~me   Count~de   Year   E~CAPT~b  
 5645.     Cyprus        CYP   1961          .  

. list CountryName CountryCode Year ER_FSH_CAPT_MT_wb if _n == 2796, clean

         Count~me   Count~de   Year   E~CAPT~b  
 2796.     Cyprus        CYP   1962        600

How, can I drop these missing observations i.e. all observations that are missing e.g. 5644, 5645, etc? Is there a way not to do this manually in data editor? Thank you.

Kind Regards,
Roger

Last edited by Roger More; 28 Jul 2017, 13:33.

Comment

Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#8

28 Jul 2017, 14:39

You can take advantage of the fact that Stata considers missing values greater than any number. Try the commands below to see if this drops only the 114 observations with missing data for ER_FSH_CAPT_MT_wb.

Code:

sort CountryCode Year ER_FSH_CAPT_MT_wb by CountryCode Year: drop if _n==2
Comment
Roger More

Join Date: Jul 2017

Posts: 59
#9

29 Jul 2017, 05:27

Dear Professor Friedrich,

. It dose work!! Thank you!! However, I would like to understand the following line more what happened exactly. So first you sorted the data to make missing always the second observation since missing is always large than any other number and then you dropped every second observation through "drop if _n==2".
Is this logic correct?

Secondly, why do we need to do "by CountryCode Year:" exactly?
Thank you again.

Cheers,
Roger
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#10

29 Jul 2017, 09:15

Your understanding of the first command is correct. The by prefix is needed to group the observations.

Code:

help by

I am not a professor.
Comment
Roger More

Join Date: Jul 2017

Posts: 59
#11

29 Jul 2017, 12:51

Thank you very much Mr Friedrich!
Comment

Announcement

Problem with Merging Panel Data (Country Year)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment