How do you merge the unique ids of one dataset into another dataset?

Xiao Roland Tan

Join Date: Sep 2017

Posts: 9
#1

How do you merge the unique ids of one dataset into another dataset?

09 Sep 2017, 11:15

Hey good people,

I have two datasets. A has 2000 data points, while B has 10000 data points.

A dataset is uniquely identified with three vars (region, commune, and village), and B dataset is uniquely identified with four vars (region, commune, village, and polling station names). B cannot be uniquely identified with the three vars that A has.

Now, I'd like to create a concordance list: region/commune/village in A, and region/commune/village in B.

I am thinking about creating unique ids for A dataset first, and then apply (merge) those ids to B.
For example,

In A:
Region A, commune A, village A, unique id: 1
Region A, commune A, village B, unique id: 2

In B:

Region A, commune A, village A, polling station A, matching id: 1
Region A, commune A, village A, polling station B, matching id: 1
Region A, commune A, village B, polling station A, matching id: 2
Region A, commune A, village B, polling station B, matching id: 2

Do you guys know how to create a concordance list like that and how do you use A's unique ids to code B?

Anyone?

Thanks!

Last edited by Xiao Roland Tan; 09 Sep 2017, 11:18.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

09 Sep 2017, 11:54

your subject line says merge but that doesn't appear, in any obvious sense, in your text; however, I do think that a 1:m or m:1 (depending on which dataset is the master) is what you want; read

Code:

help merge

my guess is that you want something like:

Code:

use B sort region commune village poll_station save B2 us A sort region commune village merge 1:m region commune village using B2

you should, of course, use your real variable names
Comment
Xiao Roland Tan

Join Date: Sep 2017

Posts: 9
#3

09 Sep 2017, 12:33

Originally posted by Rich Goldstein View Post

your subject line says merge but that doesn't appear, in any obvious sense, in your text; however, I do think that a 1:m or m:1 (depending on which dataset is the master) is what you want; read

Code:

help merge

my guess is that you want something like:

Code:

use B sort region commune village poll_station save B2 us A sort region commune village merge 1:m region commune village using B2

you should, of course, use your real variable names

Dear Rich,

Thank you for your suggestion. I also agree 1:m is a good fit, but I don't really need the final merge product.

I want the two files sharing the same ids in two separate files. Basically, I want to translate file A's unique id into code file B even though A is uniquely defined by 3 vars while B is uniquely defined by 4 vars. The ids in file B will not uniquely identify B, so we can call them "matching ids." I only need to create a concordance list between A and B. Do you think it is possible?

Thanks!

Last edited by Xiao Roland Tan; 09 Sep 2017, 12:37.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

09 Sep 2017, 12:56

If some combinations of region commune village occur in one dataset but not the other, you need to combine data from both datasets, reduce to unique combinations of region commune village, create the identifiers and then merge these back with the original datasets. Here's the general approach using made-up data:

Code:

clear
set seed 132
set obs 5
gen region = _n
expand 5
bysort region: gen commune = _n
expand 3
bysort region commune: gen village = _n
drop if runiform() < .1
save "dataset_A.dta", replace

clear
set obs 5
gen region = _n
expand 5
bysort region: gen commune = _n
expand 3
bysort region commune: gen village = _n
drop if runiform() < .2
expand 7
bysort region commune village: gen poll_id = _n
save "dataset_B.dta", replace

* reduce dataset A and B to one observation per group and append
use region commune village using "dataset_A.dta", clear
bysort region commune village: keep if _n == 1
save "dataset_A1.dta", replace
use region commune village using "dataset_B.dta", clear
bysort region commune village: keep if _n == 1

* combine observations that identify groups from both A and B
append using "dataset_A1.dta"

* remove duplicates and assign a final identifier
bysort region commune village: keep if _n == 1
gen long group_id = _n
save "region_commune_village_id.dta", replace

* return to original data and merge with new group identifiers
use "dataset_A.dta", clear
merge 1:1 region commune village using "region_commune_village_id.dta", ///
    assert(match using) keep(match) nogen
    
use "dataset_B.dta", clear
merge m:1 region commune village using "region_commune_village_id.dta", ///
    assert(match using) keep(match) nogen

Comment

Xiao Roland Tan

Join Date: Sep 2017

Posts: 9
#5

09 Sep 2017, 13:15

Originally posted by Robert Picard View Post

If some combinations of region commune village occur in one dataset but not the other, you need to combine data from both datasets, reduce to unique combinations of region commune village, create the identifiers and then merge these back with the original datasets. Here's the general approach using made-up data:

Code:

clear set seed 132 set obs 5 gen region = _n expand 5 bysort region: gen commune = _n expand 3 bysort region commune: gen village = _n drop if runiform() < .1 save "dataset_A.dta", replace clear set obs 5 gen region = _n expand 5 bysort region: gen commune = _n expand 3 bysort region commune: gen village = _n drop if runiform() < .2 expand 7 bysort region commune village: gen poll_id = _n save "dataset_B.dta", replace * reduce dataset A and B to one observation per group and append use region commune village using "dataset_A.dta", clear bysort region commune village: keep if _n == 1 save "dataset_A1.dta", replace use region commune village using "dataset_B.dta", clear bysort region commune village: keep if _n == 1 * combine observations that identify groups from both A and B append using "dataset_A1.dta" * remove duplicates and assign a final identifier bysort region commune village: keep if _n == 1 gen long group_id = _n save "region_commune_village_id.dta", replace * return to original data and merge with new group identifiers use "dataset_A.dta", clear merge 1:1 region commune village using "region_commune_village_id.dta", /// assert(match using) keep(match) nogen use "dataset_B.dta", clear merge m:1 region commune village using "region_commune_village_id.dta", /// assert(match using) keep(match) nogen

Thank you Robert and I will definitely save your code for future reference! But, in the mean time, I am afraid that I should not reduce the data points in dataset B to unique combinations of 3 vars. Otherwise, I would lose 70% of the data points in B as B is uniquely identified by 4 vars. I need that 70% of data points for analysis purposes.

I wondered whether I could do something like this? First, uniquely identify A by 3 vars. Then, create a column in B, and name it "matching ids." Then, match this column to the unique ids in A. If A and B share the same region/commune/village, then B will have the same id as A. Note this id is not unique in B as B, in most cases, has multiple data points sharing the same 3 vars.

Then, I will have a list of a concordance list: region/commune/village in A, and region/commune/village in B.

In A:
Region A, commune A, village A, unique id: 1
Region A, commune A, village B, unique id: 2

In B:

Region A, commune A, village A, polling station A, matching id: 1
Region A, commune A, village A, polling station B, matching id: 1
Region A, commune A, village A, polling station C, matching id: 1
...
Region A, commune A, village B, polling station A, matching id: 2
Region A, commune A, village B, polling station B, matching id: 2
...

Is this possible? Thanks!

Last edited by Xiao Roland Tan; 09 Sep 2017, 13:20.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#6

09 Sep 2017, 13:21

You are not loosing anything in either dataset, the example creates a concordance table that you then merge back with each original dataset.
1 like
Comment
Xiao Roland Tan

Join Date: Sep 2017

Posts: 9
#7

09 Sep 2017, 13:25

Originally posted by Robert Picard View Post

You are not loosing anything in either dataset, the example creates a concordance table that you then merge back with each original dataset.

Oh okay Robert. Thank you very much!!!
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

09 Sep 2017, 13:41

I fail to understand what purpose is served by assigning an ID variable to region/commune/village combinations and producing the concordance. What advantage does your concordance list in post #5 provide that isn't provided by the following summary? What problem is the concordance list supposed to solve?

Code:

observations in B                                matching observations in A
region A commune A village A polling station A   region A commune A village A
region A commune A village A polling station B   region A commune A village A
region A commune A village A polling station C   region A commune A village A
...
region A commune A village B polling station A   region A commune A village B
region A commune A village B polling station B   region A commune A village B
...

With that said, what you describe in post #5 is possible. You can generate the group ID in dataset A as Robert suggested

Code:

use A
sort region commune village
generate ID = _n
save A2

and then

Code:

use B
merge m:1 region commune village using A2, keepusing(ID) keep(master match)
list region commune village pollingstation if _merge != 3 // find region/commune/village in B that were not in A
drop _merge
save B2

Comment

Xiao Roland Tan

Join Date: Sep 2017
Posts: 9

09 Sep 2017, 13:56

Originally posted by Robert Picard View Post

Code:

clear
set seed 132
set obs 5
gen region = _n
expand 5
bysort region: gen commune = _n
expand 3
bysort region commune: gen village = _n
drop if runiform() < .1
save "dataset_A.dta", replace

clear
set obs 5
gen region = _n
expand 5
bysort region: gen commune = _n
expand 3
bysort region commune: gen village = _n
drop if runiform() < .2
expand 7
bysort region commune village: gen poll_id = _n
save "dataset_B.dta", replace

* reduce dataset A and B to one observation per group and append
use region commune village using "dataset_A.dta", clear
bysort region commune village: keep if _n == 1
save "dataset_A1.dta", replace
use region commune village using "dataset_B.dta", clear
bysort region commune village: keep if _n == 1

* combine observations that identify groups from both A and B
append using "dataset_A1.dta"

* remove duplicates and assign a final identifier
bysort region commune village: keep if _n == 1
gen long group_id = _n
save "region_commune_village_id.dta", replace

* return to original data and merge with new group identifiers
use "dataset_A.dta", clear
merge 1:1 region commune village using "region_commune_village_id.dta", ///
assert(match using) keep(match) nogen

use "dataset_B.dta", clear
merge m:1 region commune village using "region_commune_village_id.dta", ///
assert(match using) keep(match) nogen

Hey Robert, I understand your approach now, and I think it is brilliant.

I have a follow up question:

What if I only want groups that are included in A have IDs?

I don't want to assign identifiers to groups that are only included in B.

Can I assign unique ids to A based on 3 vars, and than merge this list (name egion_commune_village_id.dta) of identifiers to both files, using your code:

"use "dataset_A.dta", clear
merge 1:1 region commune village using "region_commune_village_id.dta", ///
assert(match using) keep(match) nogen

use "dataset_B.dta", clear
merge m:1 region commune village using "region_commune_village_id.dta", ///
assert(match using) keep(match) nogen"

Thanks!

Last edited by Xiao Roland Tan; 09 Sep 2017, 13:58.

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#10

10 Sep 2017, 08:50

I've skimmed over this thread again and I'm confused. If you only want combinations that are in A to have IDs, then create the IDs in A and merge them with B:

Code:

* verify that dataset A has one observation per group use "dataset_A.dta", clear isid region commune village, sort * generate an identifier gen long group_id = _n merge 1:m region commune village using "dataset_B.dta" tab _merge

The _merge variable will show which observations comes from A only, B only, or from both.
1 like
Comment
Xiao Roland Tan

Join Date: Sep 2017

Posts: 9
#11

10 Sep 2017, 09:02

Originally posted by Robert Picard View Post

I've skimmed over this thread again and I'm confused. If you only want combinations that are in A to have IDs, then create the IDs in A and merge them with B:

Code:

* verify that dataset A has one observation per group use "dataset_A.dta", clear isid region commune village, sort * generate an identifier gen long group_id = _n merge 1:m region commune village using "dataset_B.dta" tab _merge

The _merge variable will show which observations comes from A only, B only, or from both.

Thank you Robert!!!
Comment

Announcement