How to create unique identifier for two datasets for merging

Chei Bukari

Join Date: Sep 2017

Posts: 14
#1

How to create unique identifier for two datasets for merging

14 Sep 2017, 07:45

Dear all,
I have two datasets (A &B) from two sections (A &B) of the same survey (Ghana Living Standard Survey, GLSS 6). Dataset A is on FINANCIAL SERVICES of households while dataset B is on AS SETS AND DURABLE CONSUMER GOODS of these households
But each of this datasets has no unique identifier. I want to know the proportion of households owing various assets and consumer durables by locality.
1. Please How do I create ID that uniquely identify observations for the these two datasets so as merge them?

2. Please help me understand what could possibly went wrong if you do crosstabs and some of your observations dropped, is it from dataset or stata command issues?
for example

tab loc5 (where var loc5 means various levels of locality)

loc5 Freq. Percent
---------------+-----------------------------------
Accra (GAMA) | 1,697 10.12
Other Urban | 5,748 34.27
Rural Coastal | 1,156 6.89
Rural Forest | 3,863 23.03
Rural Savannah | 4,308 25.69
---------------+-----------------------------------
Total | 16,772 100.00

tab s12aq16 (where var s12aq16 means do you have an insurance policy?

s12aq16 Freq. Percent
------------+-----------------------------------
yes | 13,822 22.54
no | 47,500 77.46
------------+-----------------------------------
Total | 61,322 100.00

And then I do crosstabs of loc5 by s12aq16 or the other way round but some of my observations dropped. Such as below.
tab loc5 s12aq16

| s12aq16
Loc5 yes no | Total
---------------+----------------------+----------
Accra (GAMA) | 553 1,141 | 1,694
Other Urban | 2,129 3,608 | 5,737
Rural Coastal | 237 918 | 1,155
Rural Forest | 873 2,989 | 3,862
Rural Savannah | 589 3,708 | 4,297
---------------+----------------------+----------
Total | 4,381 12,364 16,745

Ideally the minimum total should have been 16,772 and not 16,745.
So what went wrong and 27 observations were dropped?
Please sort me out.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35637
#2

14 Sep 2017, 08:05

#2 Look out for missing values. tabulate has an option to show them as such.
Comment

Chei Bukari

Join Date: Sep 2017
Posts: 14

14 Sep 2017, 08:05

sorry the tables were not properly displayed it should rather look at these:

tab loc5 (where var loc5 means various levels of locality)

loc5	Freq.	. Percent
Accra (GAMA)	1,697	10.12
Other Urban	5,748	34.27
Rural Coastal	1,156	6.89
Rural Forest	3,863	23.03
Rural Savannah	4,308	25.69
Total	16,772	100.00

tab s12aq16 (where var s12aq16 means do you have an insurance policy?

s12aq16	Freq.	. Percent
yes	13,822	22.54
no	47,500	77.46
Total	61,322	100

And then I do crosstabs of loc5 by s12aq16 or the other way round but some of my observations dropped. Such as below.
tab loc5 s12aq16

loc5	yes	no	Total
Accra (GAMA)	1,697	10.12	1,694
Other Urban	5,748	34.27	5,737
Rural Coastal	1,156	6.89	1,155
Rural Forest	3,863	23.03	3,862
Rural Savannah	4,308	25.69	4,297
Total	4,381	12,364	16,745

Ideally the minimum total should have been 16,772 and not 16,745.
So what went wrong and 27 observations are dropped?
Please sort me out

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#4

14 Sep 2017, 08:52

How do I create ID that uniquely identify observations for the these two datasets so as merge them?

This may or may not be possible. One possibility is that the two data sets come with corresponding observations in the same order. This is unusual, but the documentation that accompanied the survey data would explain this if it is so. In that case -merge 1:1 _n- will do the job.

Another possibility is that although there is no single identifying variable in either data set, there is some combination of variables, common to both datasets, that together uniquely identify respondents. There might be, for example, a household identifier variable in combination with another variable that identifies individual people within the household. In a situation like that you could use -merge 1:1 household_id individual_id- to combine the data sets.

If neither of those situations obtains, then you cannot merge them. The best you could hope to do is an approximate merge. Even this would require a set of common variables that almost identify individual respondents. For example, the combination of a household identifier and date of birth would usually, but not always, identify individuals (but would fail to distinguish two household members with the same date of birth.) In that case you could combine the data sets using those variables, but you would not have certainty that all of your pairings are correct.

If you don't even have a set of common variables that come close to identifying the individuals, then I don't think there is anything at all you can do.
1 like
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#5

14 Sep 2017, 13:47

HI,

(2) has been answered already: It is tabulate's option ", missing" that can be used to include all observations in the tables.

Regarding your question (1):
According to the online documentation of the named survey, both the dataset on ASSETS AND DURABLE CONSUMER GOODS and the dataset on FINANCIAL SERVICES contain a variable "HID", the unique household number. This is your primary merging variable, and it will allow you to join your information together.

However, I assume what irritates you is the fact that, inside of each of both data files, this HID variable is not uniquely identifying observations; without having access to example data from both files (by the way: please read the FAQ, especially the section about how to post example data using -dataex-, to enable us giving the best possible answers to your question) it is hard to tell, but my assumption is that each file contains more than one observation per household because it lists several assets (or financial services, respectively) in a long data format.

Thus, the solution would be to reduce each file to the information you're interested in (like, for instance, the total value of all assets per household), so that each datafile is shrunk to one observation per household, before merging them.

Regards
Bela
1 like
Comment
Chei Bukari

Join Date: Sep 2017

Posts: 14
#6

19 Sep 2017, 07:47

Thank you all for your tremendous suggestions on unique ID problem. I has been able to successful merge the datasets using your suggestions.
Kindly permit me to share this with you.
First, I created an ID unique to both datasets and then merged based on this unique ID..
The second option which equally worked perfectly was that, I realized there was a particular ID (i.e., person id) common to both datasets but labeled differently (i.e., one with lower case and the other with upper case). so the simplest thing I did was to change the labeling such that it suits both datasets. After this then I merged the two datasets. To be sure I did the right thing, I replicated a couple of 10 tables of interest in the Main report (GLSS 6 main report). Fortunately, there were consistencies indicating that I am at least save with the merging.
Once again, thank you all.
Comment

Announcement