Help required with identifying observations between two datasets

Abhishek Patil

Join Date: Dec 2020
Posts: 28

Help required with identifying observations between two datasets

02 Apr 2021, 04:58

Hello,

I am working with a dataset where I have identified 7,286 observations. These observations belong to different households. Now, for these observations, using the Household they belong to, I want to identify the Household Head from another dataset. The other dataset has 42,151 observations who are Household Heads. I only want to keep the observations in the second set who belong to the same Household as the observations from the first set.

I have tried using the merge function but I keep receiving an error due to the mismatch in unique observations. I am guessing since the variable from the first dataset has only 7,286 observations which is much lesser than the 42,151 observations in the second dataset where I want to merge the variable to.

I will put an example of the first dataset below:

The first column/variable 'id' is the unique identification number for the individuals.
The second column/variable "HHBASE" represents the unique Household Base ID which remains unchanged for the two datasets.

Now, in my other dataset, I want to keep only observations with the following HHBASE values:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 id double HHBASE
"101020102010" 1010201020
"10102010205"  1010201020
"10102010206"  1010201020
"10102010207"  1010201020
"10102010304"  1010201030
"10102010305"  1010201030
"10102010306"  1010201030
"10102010307"  1010201030
"10102011304"  1010201130
"10102011305"  1010201130
"10102011306"  1010201130
"10102011307"  1010201130
"10102011702"  1010201170
"10102011703"  1010201170
"10102011704"  1010201170
"10102011705"  1010201170
"10102012007"  1010201200
"10102020103"  1010202010
"10102020104"  1010202010
"10102020105"  1010202010
"10102020403"  1010202040
"10102020405"  1010202040
"10102020505"  1010202050
"10102021205"  1010202120
"10102021206"  1010202120
"10102030105"  1010203010
"10102030106"  1010203010
"10102030107"  1010203010
"10102030207"  1010203020
"10102030208"  1010203020
"10102030209"  1010203020
"10102030406"  1010203040
"10102030505"  1010203050
"10102030506"  1010203050
"101020306010" 1010203060
"101020306013" 1010203060
"101020306014" 1010203060
"10102030605"  1010203060
"10102030606"  1010203060
"10102030706"  1010203070
"10102031305"  1010203130
"10102031306"  1010203130
"10102031307"  1010203130
"10102031309"  1010203130
"101020403011" 1010204030
"101020403012" 1010204030
"10102040305"  1010204030
"10102040306"  1010204030
"10102040604"  1010204060
"10102040605"  1010204060
"10102040606"  1010204060
"10102040607"  1010204060
"10102040608"  1010204060
"10102040904"  1010204090
"10102040906"  1010204090
"10102040907"  1010204090
"101020413018" 1010204130
"101020413019" 1010204130
"10102041406"  1010204140
"10102041407"  1010204140
"10102041506"  1010204150
"10102042003"  1010204200
"10102051204"  1010205120
"10102051205"  1010205120
"10102051206"  1010205120
"10102051207"  1010205120
"10102051503"  1010205150
"10102051505"  1010205150
"10102051506"  1010205150
"10102051507"  1010205150
"10102051906"  1010205190
"10102051907"  1010205190
"10102051908"  1010205190
"10102060603"  1010206060
"10102060604"  1010206060
"10102060703"  1010206070
"10102060704"  1010206070
"10102060803"  1010206080
"10102060804"  1010206080
"10102060805"  1010206080
"10102060806"  1010206080
"10102061005"  1010206100
"10102061006"  1010206100
"10102070606"  1010207060
"10102070607"  1010207060
"10102070608"  1010207060
"10102071003"  1010207100
"10102071004"  1010207100
"10102071104"  1010207110
"10102071105"  1010207110
"10102071106"  1010207110
"10102071107"  1010207110
"10102080603"  1010208060
"10102080604"  1010208060
"10102080605"  1010208060
"10102080703"  1010208070
"10102080704"  1010208070
"10102080705"  1010208070
"101020808010" 1010208080
"101020808011" 1010208080
end

Your guidance will be of great help. Thanks!

Tags: data, Dataset, merge, Suggestion

Rich Goldstein

Join Date: Mar 2014

Posts: 4496
#2

02 Apr 2021, 06:03

I am confused - the different number of distinct observations in each data set should not be an issue at all; please show exactly what you typed and exactly what Stata responded with; also, please show a -dataex- example of the other data set so people can try their own ideas of code
Comment

Abhishek Patil

Join Date: Dec 2020
Posts: 28

02 Apr 2021, 06:15

Hi Rich,

Here's the dataex example for the other dataset.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 id double HHBASE
"10102010201"  1010201020
"10102010301"  1010201030
"10102010401"  1010201040
"10102010502"  1010201050
"10102010601"  1010201060
"10102010701"  1010201070
"10102010801"  1010201080
"10102010902"  1010201090
"10102010903"  1010201090
"10102011001"  1010201100
"10102011101"  1010201110
"10102011201"  1010201120
"10102011301"  1010201130
"10102011401"  1010201140
"10102011601"  1010201160
"10102011707"  1010201170
"10102011803"  1010201180
"10102011901"  1010201190
"10102012001"  1010201200
"10102020101"  1010202010
"10102020201"  1010202020
"10102020301"  1010202030
"10102020401"  1010202040
"10102020502"  1010202050
"10102020603"  1010202060
"10102020701"  1010202070
"10102020801"  1010202080
"10102020901"  1010202090
"10102021001"  1010202100
"10102021101"  1010202110
"10102021203"  1010202120
"10102021208"  1010202120
"101020212015" 1010202120
"10102021301"  1010202130
"10102021401"  1010202140
"10102021501"  1010202150
"10102021601"  1010202160
"10102021701"  1010202170
"10102021801"  1010202180
"10102021901"  1010202190
"10102022001"  1010202200
"10102030101"  1010203010
"10102030201"  1010203020
"10102030401"  1010203040
"10102030501"  1010203050
"10102030601"  1010203060
"10102030701"  1010203070
"10102030803"  1010203080
"10102030805"  1010203080
"10102030901"  1010203090
"10102031001"  1010203100
"10102031003"  1010203100
"10102031101"  1010203110
"10102031201"  1010203120
"10102031304"  1010203130
"10102031401"  1010203140
"10102031501"  1010203150
"10102031601"  1010203160
"10102031701"  1010203170
"10102031807"  1010203180
"10102031903"  1010203190
"10102032008"  1010203200
"10102040101"  1010204010
"10102040201"  1010204020
"10102040301"  1010204030
"10102040401"  1010204040
"10102040501"  1010204050
"10102040601"  1010204060
"10102040701"  1010204070
"10102040801"  1010204080
"10102040803"  1010204080
"10102040901"  1010204090
"10102041001"  1010204100
"10102041105"  1010204110
"10102041201"  1010204120
"10102041303"  1010204130
"101020413011" 1010204130
"101020413016" 1010204130
"10102041403"  1010204140
"10102041501"  1010204150
"10102041601"  1010204160
"10102041701"  1010204170
"101020417010" 1010204170
"10102041801"  1010204180
"10102041901"  1010204190
"10102042001"  1010204200
"10102050101"  1010205010
"10102050201"  1010205020
"10102050301"  1010205030
"10102050401"  1010205040
"10102050501"  1010205050
"10102050503"  1010205050
"10102050601"  1010205060
"10102050701"  1010205070
"10102050802"  1010205080
"10102050901"  1010205090
"10102051001"  1010205100
"10102051101"  1010205110
"10102051201"  1010205120
"10102051301"  1010205130
end

The dataex example you see above is from the second dataset, and all observations there are Household Heads. I only want the Household Heads that belong to the HHBASE from the first dataset.

What I tried doing was to merge the key variable HHBASE from my first dataset (with 7,268 obs.) into the second dataset (with 42,151 obs.), and then planned on dropping the ones which do not match. In the first place I am not able to merge, and secondly, I am not even sure anymore how I drop the non-matching HHBASE observations.

Here's the merge code I attempted to use

Code:

merge m:1 HHBASE using "2011.dta"
variable HHBASE does not uniquely identify observations in the using data
r(459)

Hope this will provide a bit more clarity.

Comment

Ken Chui

Join Date: Aug 2014

Posts: 1062
#4

02 Apr 2021, 06:42

It (still) does not make sense to me as well:
The first data set does not look like household-level data (7,268) because there are repeated HHBASE id.

The second data set (42,151) may look like a person-level data from the household heads because both HHBASE and id seem to have no repetitions.

However, how can 7,268 households have 42,151 household heads? That's, on average, more than 6 heads in a household.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4496
#5

02 Apr 2021, 06:44

if the above were the only variables in these data sets, you could just use the -duplicates- command to get rid of the extras and all would be fine - but presumably there are other variables of interest in each of the data sets, correct? (if there are other variables of interest in only one data set, just make a new file with only HHBASE and use -duplicates- and then merge); so, I'm still a bit confused - what do you want from each data set and how do you want to combine them? my guess is that you want -joinby- but -cross- is also possible; see

Code:

help duplicates help joinby help cross
1 like
Comment
Abhishek Patil

Join Date: Dec 2020

Posts: 28
#6

02 Apr 2021, 06:58

Originally posted by Ken Chui View Post

It (still) does not make sense to me as well:
The first data set does not look like household-level data (7,268) because there are repeated HHBASE id.

The second data set (42,151) may look like a person-level data from the household heads because both HHBASE and id seem to have no repetitions.

However, how can 7,268 households have 42,151 household heads? That's, on average, more than 6 heads in a household.

Hi Ken,

Yea I understand that it is not very clear from what I have written. Also, a lot of work has already done on the data so a lot of the "backstory" may be missing.
I will try to explain that as briefly as possible:

- Both datasets are individual-level datasets
- Variable "id" assings them unique id's based on their geographical location (STATEID + DISTRICTID + NEIGHBOURHOODID) combined with (HHID + PERSONID)
- The first dataset (7,268 obs.) is a subsample of individuals who have fulfilled certain conditions that were set I am investigating their income.
- The reason behind the repeated HHBASE id's is because there is a minimum of 2 siblings from each HH.
- The second data set does not seem to have any repetitions but there are very few cases that actually are repeated. But since most Households have 1 Head, it appears that way.
- You're somewhat right about the number of HHs and Heads. There aren't 7,268 Households (due to repetitions), but totally right that they cannot have such a great number of Heads. That is exactly what I am trying to take care of.
- The second dataset currently consists of too many heads, even the ones I am not interested in.
- So for that reason, I want to identify the HHBASE in dataset 1 and on its basis find the respective HH Head from dataset 2.
Comment
Abhishek Patil

Join Date: Dec 2020

Posts: 28
#7

02 Apr 2021, 07:07

Originally posted by Rich Goldstein View Post

if the above were the only variables in these data sets, you could just use the -duplicates- command to get rid of the extras and all would be fine - but presumably there are other variables of interest in each of the data sets, correct? (if there are other variables of interest in only one data set, just make a new file with only HHBASE and use -duplicates- and then merge); so, I'm still a bit confused - what do you want from each data set and how do you want to combine them? my guess is that you want -joinby- but -cross- is also possible; see

Code:

help duplicates help joinby help cross

Hi Rich,

Yes, they are not the only variables and so just dropping the duplicates might not work, unfortunately.

Regarding the confusion, I hope my reply above might also help provide a slightly better understanding.

In my first dataset (7,268) I have the "Children" in the HH. But now, I would like to add some information about the head of the household they live in (income and education primarily), and the Household Heads are listed in the second dataset.

The first dataset is significantly different in the number of observations because I have possed a number of conditions to be left with the sample I am actually interested in investigating. The second dataset has not been "restricted" in any way. Since I just am interested in the HH head for the "Children" in DS1, I wanted to drop the ones that don't have the HHBASE (DS2) equal the HHBASE (DS1), thus leaving me with only the corresponding heads.

Then I wish to add the variables for HH Head's income and education to dataset 1. So then I have information on the children and the Head's income and education.
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1062
#8

02 Apr 2021, 07:19

Thanks for the explanation. To avoid further confusion I am going to call the data that is supposed to have unique HHBASE "Data_A" and the one that have repeated HHBASE containing household heads "Data_B".

First, with Data_A opened use duplicates report to ascertain the HHBASE uniqueness:

Code:

duplicates report HHBASE

If there are no duplicates, the resultant table should only show one line, with N people having only 1 copy. If there are duplicates, you'd see more lines, and they'd need to be reconciled.

Second, with Data A opened, try:

Code:

merge 1:m HHBASE using Data_B

Once you have merged, there would be a new variable called "_merge" added to the data. Stata will also provide a table of that variable by default. It has three levels indicating how many got a merge (3), and how many were only in the master file with no merge (1) or in the using file with no merge (2). Based on your need, you can then use keep or drop on this _merge variable to trim it down to the desirable sample size.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4496
#9

02 Apr 2021, 07:35

the response in #6 makes it clear that there are duplicates in each file but does not make clear, at least to me, what you want to do with those (it appears you want to keep the duplicates in the "first" file but what you want to do with the duplicates in the "second" file is not clear to me; if Ken Chui is correct in #8, fine - otherwise please clarify (but I still guess that -joinby- is what you probably want in that case)
Comment
Abhishek Patil

Join Date: Dec 2020

Posts: 28
#10

02 Apr 2021, 07:43

Ken Chui I think you mean the other way round with the data you name A and B? The data with repeated HHBASE does not contain Household Heads.
Comment
Abhishek Patil

Join Date: Dec 2020

Posts: 28
#11

02 Apr 2021, 07:59

Rich Goldstein
Yes, in the first dataset I would like to keep the duplicate households since the individuals with the same/duplicate household are still distinct (they are siblings), and I will be investigating the difference in their incomes.

For the second dataset, I am not so worried about the duplicates since there is almost very few.

But I tried the suggestion you made regarding saving HHBASE as a new file (named: HHBASE1.dta).
Then I dropped the duplicates, thus leaving me 3,070 obs.

Could you please guide me with the merge process?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4496
#12

02 Apr 2021, 08:08

first, with dataset1 in memory, and the new version of dataset2 (no duplicated HHBASEs) you can basically use the code you show in #3 above (just be sure that the filename is correct); note that the generated "_merge" variable will tell you about mis-matches and you can decide whether to keep or drop them
Comment
Abhishek Patil

Join Date: Dec 2020

Posts: 28
#13

02 Apr 2021, 09:09

Ken Chui Rich Goldstein
Thank you so much for your help!

I have managed to end up with a dataset with just Heads for the Households that I was interested in investigating. So the problem I needed help with has been taken care of now.

I could however use maybe some more help now that you are somewhat aware of what I am trying to do.

Now, I would like to find the income and education variables for these Household Heads, and then add/merge the information of these two variables to my first dataset (7,286 obs.).

Dataset1 includes children (7,286) and dataset2 now includes the information for their HH head (3,070 obs.)

How can I go about adding the income and education variable from dataset2 to dataset1 in a way that the children belonging to the same HH (siblings) show a repeated value for their HH Head's income and education?

I hope I have explained in detail enough.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4496
#14

02 Apr 2021, 13:07

if you left the income and education variables in dataset2 they would now be in the merged data; if you did not leave them in, you can either go back or make a new subset (doing exactly what you did before) but leaving in those variables; your m:1 merge will take care of putting them in the right places; if you need to merge again, either drop or rename the _merge variable that is left from the previous merge (or rename the new one in the merge; see

Code:

help merge
1 like
Comment

Announcement

Help required with identifying observations between two datasets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment