Merging two panel datasets with uncommon IDs

Beri Parfait

Join Date: Jun 2019

Posts: 54
#1

Merging two panel datasets with uncommon IDs

18 Dec 2023, 07:19

Dear Stata users.

Please, I need your help with this question. I have two panel datasets, A &B.

Panel A is the master dataset and contain all variables and ID's that I need in the study.

Panel B is the secondary dataset, that contains all of the ID's and variables that I need in the study, but they are not sequential. Panel B also contains many IDs that I do not need to have in my master file. Since the dataset is so, large, I cannot sort them one after another before merging the two datasets.

I want to match these IDs in panel A with B to obtain a dataset that retains every information in panel A, but only the required IDs from B in the master dataset.

I tried

merge 1:m ID time using "C:\Users\Beri Parfait\Desktop\Data\HDI_v2.dta"

. The data merges, but mismatches the IDs as it does so sequentially.
I tried

joinby ID time using "C:\Users\Beri Parfait\Desktop\Data\HDI_v2.dta", unmatched(master)

, but there are similar mismatches with IDs. Someone (William Lisowski ) suggested this

HTML Code:

Merging two sets of panel data - Statalist

but it didn't work for me.

Please, can you advise on how I can proceed?
Thank you in advance.
Tags: data, merge, panel data
George Ford

Join Date: Aug 2014

Posts: 3177
#2

18 Dec 2023, 07:42

Are the IDs the same across the datasets? Are there variables in B that you need to match to A? Or are these two seperate panels measuring the same sort of thing?
Comment
Beri Parfait

Join Date: Jun 2019

Posts: 54
#3

18 Dec 2023, 10:52

Originally posted by George Ford View Post

Are the IDs the same across the datasets? Are there variables in B that you need to match to A? Or are these two seperate panels measuring the same sort of thing?

Thank you for replying, George. No, the IDs are not the same. These are two separate panels measuring the same thing. Is it possible for the merge command to search and match based on ID names? If so, hope you can advise.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10242
#4

18 Dec 2023, 11:24

Your question is unclear. If you merge dataset A and dataset B, the idea is that there is some overlap in the IDs in A and B, but B contains additional variables not present in A. Or you want to update some variable values in A with those from B. On the other hand, if A and B contain the same set of variables and there is an overlap in the IDs, but these are distinct datasets (say A refers to time periods 1-2 and B refers to time periods 3-4), then you do not need merge. You need append. You can mark observations in A and keep observations in B if there is a corresponding observation in A.

Code:

use master, clear gen inA=1 append using "C:\Users\Beri Parfait\Desktop\Data\HDI_v2.dta" bys ID (inA): replace inA= inA[1] keep if inA==1

Last edited by Andrew Musau; 18 Dec 2023, 11:30.
Comment
George Ford

Join Date: Aug 2014

Posts: 3177
#5

18 Dec 2023, 13:36

That what I was thinking. What Andrew said -- append.
Comment
Beri Parfait

Join Date: Jun 2019

Posts: 54
#6

18 Dec 2023, 16:46

Originally posted by Andrew Musau View Post

Your question is unclear. If you merge dataset A and dataset B, the idea is that there is some overlap in the IDs in A and B, but B contains additional variables not present in A. Or you want to update some variable values in A with those from B. On the other hand, if A and B contain the same set of variables and there is an overlap in the IDs, but these are distinct datasets (say A refers to time periods 1-2 and B refers to time periods 3-4), then you do not need merge. You need append. You can mark observations in A and keep observations in B if there is a corresponding observation in A.

Code:

use master, clear gen inA=1 append using "C:\Users\Beri Parfait\Desktop\Data\HDI_v2.dta" bys ID (inA): replace inA= inA[1] keep if inA==1

Thank you, Andrew and George, for your help, but the code did not work for me.

Regarding the clarity of the question, Panel B contains additional IDs (there are more than 80 countries in panel B) that are not needed in the study. Once I declare the two datasets as panels, A and B obviously have different identity numbers. The idea was to extract those countries that are need from Panel B and match them with those in Panel A. I still had a situation where Zimbabwe was being matched with Hongkong, and many more. Because of time, constraints, I had to return to excel.
Comment
Devra Golbe

Join Date: Apr 2014

Posts: 170
#7

18 Dec 2023, 17:22

Given your additional information, it sounds like you want to merge rather than append. Declaring your datasets as panel should not change any identifiers. If you give us small samples of your datasets (using the dataex command) so we could see the nature of the identifying variables and the structure of the datasets it would be much easier to advise. I've never done matching in Excel, but I imagine the procedure is much the same as it is in Stata: you must identify the variables which can be used to match observations across datasets.

Devra Golbe
Professor Emerita, Dept. of Economics
Hunter College, CUNY
Comment
George Ford

Join Date: Aug 2014

Posts: 3177
#8

18 Dec 2023, 18:07

It’s an append. You’ve got to figure out how to do it cleanly. All variables must have the same name across datasets. The panel nature is irrelevant prior to append, as you’re just adding observations. Open B and drop countries you don’t want and use that.

create new ID. egen group will do. Add 10000 or some other constant to the egen group ID to B so you don’t have same iD across the 2.

is there any chance you have the same respondents across the two? If si, then duplicates tag/drop to clear that.

Last edited by George Ford; 18 Dec 2023, 18:13.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10242
#9

18 Dec 2023, 22:19

A possibility is that you encoded the identifiers prior to merging/ appending. You should have string identifiers.

Code:

decode ID, gen(id) drop ID save master, replace

Then use "id" as an identifier across the datasets. At the end, the best way to understand your problem is to provide samples of your datasets as Devra suggests. You do this e.g. by copying and pasting the results of

Code:

dataex in 1/20

for the master and using files.
Comment

Beri Parfait

Join Date: Jun 2019
Posts: 54

#10

19 Dec 2023, 02:05

Thank you for your feedback. while I decode the identifiers, these are the samples of my datasets:
Panel A:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long ID int time float(adjustedsavingseducationexpendit schoolenrollmentprimarygrosssepr schoolenrollmentprimaryfemalegro schoolenrollmentprimarymalegross)
1 1990 1.9534476  34.57724 23.97939  44.77149
1 1991 1.9976286  36.11692 24.93005  46.87694
1 1992 2.0418096         .        .         .
1 1993 2.0859904  36.47658 18.24781  54.01172
1 1994 2.1301715   47.7861 30.87191  64.05412
1 1995 2.1743524  49.69833 32.46923  66.26698
1 1996 2.2185333         .        .         .
1 1997 2.2627144         .        .         .
1 1998 2.3068953  34.29688        .         .
1 1999 2.3510761  27.29885  4.08203  49.57469
1 2000  2.395257  22.16299        0  43.41481
1 2001  2.439438  22.90859        0  44.86144
1 2002  2.483619  75.95975 46.85711 103.83576
1 2003    2.5278  96.55368 68.62765 123.28975
1 2004  2.571981 106.25266 63.29473 147.35191
1 2005  2.616162  100.2297 73.18865 126.07778
1 2006  2.660343 103.93224 79.21624 127.53352
1 2007  2.704524 101.08646 76.40813  124.6185
1 2008  2.748705 103.41396 80.11948 125.58845
1 2009  2.792886  99.36972  78.6424 119.07595
end
format %ty time
label values ID ID
label def ID 1 "Afghanistan", modify

Panel B

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str38 country int time float humandevelopmentindex long ID
"Afghanistan" 1990 .273 1
"Afghanistan" 1991 .279 1
"Afghanistan" 1992 .287 1
"Afghanistan" 1993 .297 1
"Afghanistan" 1994 .292 1
"Afghanistan" 1995  .31 1
"Afghanistan" 1996 .319 1
"Afghanistan" 1997 .323 1
"Afghanistan" 1998 .324 1
"Afghanistan" 1999 .332 1
"Afghanistan" 2000 .335 1
"Afghanistan" 2001 .337 1
"Afghanistan" 2002 .362 1
"Afghanistan" 2003 .376 1
"Afghanistan" 2004 .392 1
"Afghanistan" 2005   .4 1
"Afghanistan" 2006 .409 1
"Afghanistan" 2007 .424 1
"Afghanistan" 2008  .43 1
"Afghanistan" 2009  .44 1
end
format %ty time
label values ID ID
label def ID 1 "Afghanistan", modify

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10242

#11

19 Dec 2023, 02:17

Always apply encode to the combined dataset, not prior to merging.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str38 country int time float humandevelopmentindex long ID
"Afghanistan" 1990 .273 1
"Afghanistan" 1991 .279 1
"Afghanistan" 1992 .287 1
"Afghanistan" 1993 .297 1
"Afghanistan" 1994 .292 1
"Afghanistan" 1995  .31 1
"Afghanistan" 1996 .319 1
"Afghanistan" 1997 .323 1
"Afghanistan" 1998 .324 1
"Afghanistan" 1999 .332 1
"Afghanistan" 2000 .335 1
"Afghanistan" 2001 .337 1
"Afghanistan" 2002 .362 1
"Afghanistan" 2003 .376 1
"Afghanistan" 2004 .392 1
"Afghanistan" 2005   .4 1
"Afghanistan" 2006 .409 1
"Afghanistan" 2007 .424 1
"Afghanistan" 2008  .43 1
"Afghanistan" 2009  .44 1
end
format %ty time
label values ID ID
label def ID 1 "Afghanistan", modify

drop ID
tempfile using
save `using', replace


* Example generated by -dataex-. For more info, type help dataex
clear
input long ID int time float(adjustedsavingseducationexpendit schoolenrollmentprimarygrosssepr schoolenrollmentprimaryfemalegro schoolenrollmentprimarymalegross)
1 1990 1.9534476  34.57724 23.97939  44.77149
1 1991 1.9976286  36.11692 24.93005  46.87694
1 1992 2.0418096         .        .         .
1 1993 2.0859904  36.47658 18.24781  54.01172
1 1994 2.1301715   47.7861 30.87191  64.05412
1 1995 2.1743524  49.69833 32.46923  66.26698
1 1996 2.2185333         .        .         .
1 1997 2.2627144         .        .         .
1 1998 2.3068953  34.29688        .         .
1 1999 2.3510761  27.29885  4.08203  49.57469
1 2000  2.395257  22.16299        0  43.41481
1 2001  2.439438  22.90859        0  44.86144
1 2002  2.483619  75.95975 46.85711 103.83576
1 2003    2.5278  96.55368 68.62765 123.28975
1 2004  2.571981 106.25266 63.29473 147.35191
1 2005  2.616162  100.2297 73.18865 126.07778
1 2006  2.660343 103.93224 79.21624 127.53352
1 2007  2.704524 101.08646 76.40813  124.6185
1 2008  2.748705 103.41396 80.11948 125.58845
1 2009  2.792886  99.36972  78.6424 119.07595
end
format %ty time
label values ID ID
label def ID 1 "Afghanistan", modify

decode ID, gen(country)
drop ID
merge 1:1 country time using `using', keep(master match) nogen

Res.:

Code:

. order country time

. l, sepby(country)

     +---------------------------------------------------------------------------+
     |     country   time   adjust~t   school~r   school~o   school~s   humand~x |
     |---------------------------------------------------------------------------|
  1. | Afghanistan   1990   1.953448   34.57724   23.97939   44.77149       .273 |
  2. | Afghanistan   1991   1.997629   36.11692   24.93005   46.87694       .279 |
  3. | Afghanistan   1992    2.04181          .          .          .       .287 |
  4. | Afghanistan   1993    2.08599   36.47658   18.24781   54.01172       .297 |
  5. | Afghanistan   1994   2.130172    47.7861   30.87191   64.05412       .292 |
  6. | Afghanistan   1995   2.174352   49.69833   32.46923   66.26698        .31 |
  7. | Afghanistan   1996   2.218533          .          .          .       .319 |
  8. | Afghanistan   1997   2.262714          .          .          .       .323 |
  9. | Afghanistan   1998   2.306895   34.29688          .          .       .324 |
 10. | Afghanistan   1999   2.351076   27.29885    4.08203   49.57469       .332 |
 11. | Afghanistan   2000   2.395257   22.16299          0   43.41481       .335 |
 12. | Afghanistan   2001   2.439438   22.90859          0   44.86144       .337 |
 13. | Afghanistan   2002   2.483619   75.95975   46.85711   103.8358       .362 |
 14. | Afghanistan   2003     2.5278   96.55368   68.62765   123.2897       .376 |
 15. | Afghanistan   2004   2.571981   106.2527   63.29473   147.3519       .392 |
 16. | Afghanistan   2005   2.616162   100.2297   73.18865   126.0778         .4 |
 17. | Afghanistan   2006   2.660343   103.9322   79.21624   127.5335       .409 |
 18. | Afghanistan   2007   2.704524   101.0865   76.40813   124.6185       .424 |
 19. | Afghanistan   2008   2.748705    103.414   80.11948   125.5884        .43 |
 20. | Afghanistan   2009   2.792886   99.36972    78.6424    119.076        .44 |
     +---------------------------------------------------------------------------+

Comment

George Ford

Join Date: Aug 2014

Posts: 3177
#12

19 Dec 2023, 08:09

Ah, so it was a merge.
Comment
Beri Parfait

Join Date: Jun 2019

Posts: 54
#13

20 Dec 2023, 05:47

Thank you very much, Andrew. I finally got it right with your code. I appreciate.
Comment

Announcement

Merging two panel datasets with uncommon IDs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment