Merging datasets

Tobias Scherl

Join Date: Jul 2015
Posts: 29

15 Jul 2015, 14:41

Dear all,

I am computing some diversity indices and for that issue I need to match two datasets. The first one contains all the different nationalities of people which are living in different counties (dataset1.dta). It looks like the following example:

County	Origin_1	Origin_2	Share_1	Share_2	d
1	A	B	0.2	0.6	AB
1	A	C	0.2	0.1	AC
1	A	D	0.2	0.1	AD
1	B	C	0.6	0.1	BC
1	B	D	0.6	0.1	BD
1	C	D	0.1	0.1	CD
2	A	B	0.4	0.3	AB
2	A	C	0.4	0.2	AC
2	A	D	0.4	0.1	AD
2	B	C	0.3	0.2	BC
2	B	D	0.3	0.1	BD
2	C	D	0.2	0.1	CD

The variable "County" is the county every person is living at. For each county all the permutations of the different populations are matched. So, origin_1 is the nationality of one part and origin_2 of the matched part to compute the Herfindahl-Hirtschmann-Index. Share_1 and Share_2 are the different shares regarding the origin of the population. d is a string variable, which contains all the permutations of the different origins of the people.

Now I would like to match it to a distance dataset (dataset2.dta), which looks like the following example:

Origin_1	Origin_2	d	distance
A	B	AB	0.3
A	C	AC	0.2
A	D	AD	0.7
B	C	BC	0.9
B	D	BD	0.1
C	D	CD	0.2

Origin_1 and Origin_2 are again the countries of origin where people were born in. d is the same like above and distance is a distance measure between two persons from different countries.

Now, I need to get a dataset, which looks like the following example (I need to compute: bysort county: Herfindahl-Hirschmann-Index= 1-total(share_1*share_2*distance)):

County	Origin_1	Origin_2	Share_1	Share_2	d	distance
1	A	B	0.1	...	AB	0.3
1	A	C	0.1	...	AC	0.2
1	A	D	...	...	AD	0.7
1	B	C	...	...	BC	0.9
1	B	D	...	...	BD	0.1
1	C	D	...	...	CD	0.2
2	A	B	...	...	AB	0.3
2	A	C	...	...	AC	0.2
2	A	D	...	...	AD	0.7
2	B	C	...	...	BC	0.9
2	B	D	...	...	BD	0.1
2	C	D	...	...	CD	0.2

I tried to match dataset1 and dataset2 by using the following command:

use "dataset1.dta", clear

sort d

merge n:n d using "dataset2.dta"

drop _merge

But using this command I do not get the dataset, I want. I've got something like that:

County	Origin_1	Origin_2	Share_1	Share_2	d	distance
1	A	B	...	...	AB	0.3
2	A	C	...	...	AC	0.2
2	B	C	...	...	BC	0.9
1	D	A	...	...	AD	0.7
2	B	D	...	...	BD	0.1
1	D	C	...	...	CD	0.2

And getting this dataset I cannot compute the Herfindahl-Hirschmann-Index, because I do not get a merged dataset for every county. Can someone help me?

Thanks in advance!

Tags: None

Robert Picard

Join Date: Mar 2014
Posts: 1536

15 Jul 2015, 16:08

Here's how you can merge the data you show. Note that using dataex from SSC to present data examples helps in both understanding the question and presenting a possible solution.

Code:

clear
input byte County str1(Origin_1 Origin_2) float(Share_1 Share_2) str2 d
1 "A" "B" .2 .60000002 "AB" 
1 "A" "C" .2 .1 "AC" 
1 "A" "D" .2 .1 "AD" 
1 "B" "C" .60000002 .1 "BC" 
1 "B" "D" .60000002 .1 "BD" 
1 "C" "D" .1 .1 "CD" 
2 "A" "B" .40000001 .30000001 "AB" 
2 "A" "C" .40000001 .2 "AC" 
2 "A" "D" .40000001 .1 "AD" 
2 "B" "C" .30000001 .2 "BC" 
2 "B" "D" .30000001 .1 "BD" 
2 "C" "D" .2 .1 "CD" 
end
tempfile f
save "`f'"

clear
input str1(Origin_1 Origin_2) str2 d float distance
"A" "B" "AB" .30000001 
"A" "C" "AC" .2 
"A" "D" "AD" .69999999 
"B" "C" "BC" .89999998 
"B" "D" "BD" .1 
"C" "D" "CD" .2 
end

* check that d uniquely identifies observations
isid d, sort

merge 1:m d using "`f'", keep(match using) nogen
sort County d
list, sepby(County) noobs

Comment

Tobias Scherl

Join Date: Jul 2015

Posts: 29
#3

17 Jul 2015, 08:41

Thanks again, Robert!
Ok, next time I try to use dataex!

Have a nice weekend,

Hans
Comment

Announcement