Merging (panel) datasets

Sophie Cronk

Join Date: Jun 2022

Posts: 5
#1

Merging (panel) datasets

01 Jun 2022, 05:05

Hi,
I am currently attempting to merge two datasets - panel data - both containing a unique ID (which occurs in both datasets) and 5 years of observations for each ID. For all possible merge options such ass 1:1, m:1 and 1:m I repeatedly get back that the "variables X Y do not uniquely identify observations in the master data". How can I successfully merge these two datasets? Thanks in advance!
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#2

01 Jun 2022, 05:37

without data examples (use -dataex- and post in CODE blocks - see the FAQ) it is not possible to give an exact answer; one guess is the you want -joinby- rather than -merge-; see

Code:

h joinby
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#3

01 Jun 2022, 05:45

Welcome to Statalist.

First, I'd suggest don't gamble with 1:1, m:1, and 1:m and see which one sticks. Instead, try to understand the data structure so that you have enough confidence to say that "yes, this data is supposed to contain one unique ID per row," or "yes, this data is supposed to have multiple of the same ID across rows." Without that knowledge, it's very hard to help you.

With that said, you can examine where the duplicates are and flag those duplicates for examination. To check the extend of duplication:

Code:

duplicates report X Y

To flag them for examination:

Code:

duplicates tag X Y, gen(my_flag_variable) * my_flag_variable is the name of the flag, you can rename whatever fits

There are also other types of duplicates command as well (including one that allows you to delete duplicates). To learn more, use:

Code:

help duplicates

Lastly, a more obscure reason is having multiple rows of missing values. This often happens when importing data from Excel which might come with some empty rows. Because missing "." is considered a value, having multiple of them in your matching variables will also cause the ID to be not unique.
1 like
Comment
Sophie Cronk

Join Date: Jun 2022

Posts: 5
#4

01 Jun 2022, 06:21

Thank you very much for the quick responses! I am fairly new to Statalist and thus still figuring out how it all exactly works. Regarding the two datasets in Stata that I am trying to merge:

The first dataset contains average residential property values for a given municipality (based on their municipality code) for the years 2015 - 2020.
The second dataset contains the number of jobs for a given municipality (based on the municipality code) for the years 2015 - 2020.

Both datasets therefore have the following 'look':
Municipality code | Year | Number
X | 2015 | Number
X | 2016 | Number
X | 2017 | Number, etc.
Y | 2015 | Number
Y | 2016 | Number
Y | 2017 | Number, etc.

I am aiming to have the datasets merged in such a way that I will obtain the following:
Municipality code | Year | Average residential property value | Jobs
X | 2015 | Average residential property value | Jobs
X | 2016 | Average residential property value | Jobs
X | 2017 | Average residential property value | Jobs, etc.
Y | 2015 | Average residential property value | Jobs
Y | 2016 | Average residential property value | Jobs
Y | 2017 | Average residential property value | Jobs, etc.

I have managed to combine the two datasets with the following commands, however, the dataset I then obtain lists all the years vertically below each other which causes many duplicates.

Code:

use "file1", clear sort municipality_code year save "file1", replace use "file2", clear sort municipality_code year merge municipality_code year using "file1" drop _merge

Am hoping that this might provide you with some more insights on the data. Thanks a lot for the help!
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

01 Jun 2022, 06:48

The following example code may start you in a useful direction.

Code:

clear all
cls
// create example datasets

* Example generated by -dataex-. For more info, type help dataex
clear
input str2 code int year str13 x
"X" 2015 "Number"      
"X" 2016 "Number"      
"X" 2017 "Number, etc."
"Y" 2015 "Number"      
"Y" 2016 "Number"      
"Y" 2017 "Number, etc."
end
save "file1", replace

* Example generated by -dataex-. For more info, type help dataex
clear
input str2 code int year str36 y str11 z
"X" 2015 "Average value" "Jobs"      
"X" 2016 "Average value" "Jobs"      
"X" 2017 "Average value" "Jobs, etc."
"Y" 2015 "Average value" "Jobs"      
"Y" 2016 "Average value" "Jobs"      
"Y" 2017 "Average value" "Jobs, etc."
end
save "file2", replace

// check that identifiers are distinct in each dataset

use file1, clear
duplicates report code year
use file2, clear
duplicates report code year

// merge two datasets

use file1, clear
merge 1:1 code year using "file2"
sort code year
list, sepby(code)

Code:

. // check that identifiers are distinct in each dataset
. 
. use file1, clear

. duplicates report code year

Duplicates in terms of code year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |            6             0
--------------------------------------

. use file2, clear

. duplicates report code year

Duplicates in terms of code year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |            6             0
--------------------------------------

. 
. // merge two datasets
. 
. use file1, clear

. merge 1:1 code year using "file2"

    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                                 6  (_merge==3)
    -----------------------------------------

. sort code year

. list, sepby(code)

     +-----------------------------------------------------------------------+
     | code   year              x               y            z        _merge |
     |-----------------------------------------------------------------------|
  1. |    X   2015         Number   Average value         Jobs   Matched (3) |
  2. |    X   2016         Number   Average value         Jobs   Matched (3) |
  3. |    X   2017   Number, etc.   Average value   Jobs, etc.   Matched (3) |
     |-----------------------------------------------------------------------|
  4. |    Y   2015         Number   Average value         Jobs   Matched (3) |
  5. |    Y   2016         Number   Average value         Jobs   Matched (3) |
  6. |    Y   2017   Number, etc.   Average value   Jobs, etc.   Matched (3) |
     +-----------------------------------------------------------------------+

.

Announcement

Merging (panel) datasets

Comment

Comment

Comment

Comment