Merge observations by pairs

Gin Ayra

Join Date: Dec 2021
Posts: 6

Merge observations by pairs

06 May 2022, 05:35

Hello,

I'm cleaning a huge dataset and I'm stucked in one point. I've find similar problems and nice answers in here but the structures are slightly different from mine and I can not apply this to my issue.

I have data on trade for lots of countries and years. My data for now looks like this (numbers are fake):

	country_1	country_2	year	imp	exp	"other variables"
1	greece	united_kingdom	1857	124353.66	.	XX
2	greece	united_kingdom	1857	.	4659756.67	XX
3	greece	united_kingdom	1858	574974.86	.	XX
4	greece	united_kingdom	1858	.	658796.34	XX
......	..............	.....................	.....	.................	...............	.............
99	greece	belgium	1857	56746.67	.	XX
100	greece	belgium	1857	.	45675.67	XX

And I want to merge/combine all the pairs of variables that have the same contry 1, country 2 and year, which have the same values for all the variables but "imp" and "exp", which each ones take . (missing value) in one case. So, I want to have something like this:

	country_1	country_2	year	imp	exp	"other variables"
1	greece	united_kingdom	1857	124353.66	4659756.67	XX
2	greece	united_kingdom	1858	574974.86	658796.34	XX
.....	..............	.....................	.....	.................	...............	XX
50	greece	belgium	1857	56746.67	45675.67	XX

And therefore, have the half of the observations than before.

Many thanks for your help!!

Last edited by Gin Ayra; 06 May 2022, 05:49.

Tags: data

Andrew Musau

Join Date: Oct 2014

Posts: 10482
#2

06 May 2022, 05:43

Code:

collapse (firstnm) imp exp other_variables, by(country_1 country_2 year)
Comment
Gin Ayra

Join Date: Dec 2021

Posts: 6
#3

06 May 2022, 05:55

It worked! Many thanks
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36058

06 May 2022, 05:58

This isn't a merge in the sense of Stata's command merge. It could be a collapse. But for data as you describe. there is a simple trick. For paired observations under by: observation 3 - _n is observation 2 if we are looking at observation 1, and vice versa. So that's generic syntax for "look in the twin observation".

Thus consider this example:

Code:

clear
input str9 country_1 str9 country_2 year import export x y 
"Freedonia" "Sylvania"  1856    1  .   42  666
"Freedonia" "Sylvania"  1856    .  2   42  666
"Sylvania" "Freedonia"  1857    3  .    7   11
"Sylvania" "Freedonia"  1857    .  4    7   11
end 

bysort country_1 country_2 year : replace import = import[3 - _n] if missing(import)
bysort country_1 country_2 year : replace export = export[3 - _n] if missing(export)

list 

duplicates drop 

list 


    +-----------------------------------------------------------+
     | country_1   country_2   year   import   export    x     y |
     |-----------------------------------------------------------|
  1. | Freedonia    Sylvania   1856        1        2   42   666 |
  2. |  Sylvania   Freedonia   1857        3        4    7    11 |
     +-----------------------------------------------------------+

I didn't use your data example -- which is clear but still needs too much editing as you didn't use dataex to show an example. See https://www.statalist.org/forums/help#stata for the longstanding request to use that command.

See https://www.stata-journal.com/articl...article=dm0043 for the main idea here.

For a paranoid check run this first

Code:

bysort country_1 country_2 year : assert _N == 2

If that fails, you should do this

Code:

 
bysort country_1 country_2 year : gen nobs = _N 
edit if nobs != 2

Announcement

Merge observations by pairs

Comment

Comment

Comment