How to merge multiple sub-datasets of a household survey

Dung Le

Join Date: May 2018

Posts: 120
#1

How to merge multiple sub-datasets of a household survey

13 Jun 2018, 23:16

Dear all,

I know that there are some threats asking about data merging issues and I have read some of them. However, I find my dataset are quite different relative to what I have read in these forums. In addition, I have tried to apply what I have learned in those threats into my datasets but unfortunately it did not work. Therefore, I decided to make a new one and followings are my data and coding information:

In fact, I have three sub-datasets from a household survey, namely: file1 that contains demographic info; file2 contains educational info and the last one contains healthcare info. I realize that there are some common variables existed in the three files, including: province, district, commune, enumeration area, household identification and household member ID. Hence, my strategy is to base on these six variables to merge the three sub-datasets and here are my coding:

I sorted data first

Code:

use "C:\Users\file2.dta", clear count sort pro dist comm enum_area hh_ind hh_mem_id save "C:\Users\file2_sorted.dta", replace use "C:\Users\file3.dta", clear count sort pro dist comm enum_area hh_ind hh_mem_id save "C:\Users\file3_sorted.dta", replace

Merge data - I am using file1 as master and the others are using. I am using m:1 -merge

Code:

use "C:\Users\file1.dta", clear count sort pro dist comm enum_area hh_ind hh_mem_id merge m:1 pro dist comm enum_area hh_ind hh_mem_id using "C:\Users\file2_sorted.dta" tab _merge keep if _merge==3 drop _merge merge m:1 pro dist comm enum_area hh_ind hh_mem_id using "C:\Users\file3_sorted.dta" tab _merge keep if _merge==3

The codes worked for merging file1 and file2 but failed when I tried to merge the last file, it said that "variables pro dist comm enum_area hh_ind hh_mem_id do not uniquely identify observations in the using data". I do not know why so could anyone have any ideas/suggestions to solve the problem?

In my case, I do not know how to create a sample data using dataex so I attach my datasets here (please kindly find the attached).

I hope I can have replies from you.

Thank you.

Attached Files

file2.dta (95.6 KB, 1 view)

file3.dta (106.4 KB, 1 view)

file1.dta (203.0 KB, 1 view)
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

14 Jun 2018, 05:09

Welcome to the Stata Forum / Statalist.

The codes from dataex are available at Stata 15.1 and can be installed - findit dataex - if you have an older version.

Since you mentioned that you "do not know how to create a sample data using dataex", here are a few tips:

You may type - help dataex - in the command window.

That said, let's say you wish the first 10 observations, you will just need to type:

Code:

dataex varlist in 1/10

In case you wish to provide all variables, please exclude "varlist". In case you wish to select 3 variables, you just need to replace "varlist" by the respective names.

Best regards,

Marcos
Comment
Dung Le

Join Date: May 2018

Posts: 120
#3

14 Jun 2018, 06:27

Originally posted by Marcos Almeida View Post

Welcome to the Stata Forum / Statalist.

The codes from dataex are available at Stata 15.1 and can be installed - findit dataex - if you have an older version.

Since you mentioned that you "do not know how to create a sample data using dataex", here are a few tips:

You may type - help dataex - in the command window.

That said, let's say you wish the first 10 observations, you will just need to type:

Code:

dataex varlist in 1/10

In case you wish to provide all variables, please exclude "varlist". In case you wish to select 3 variables, you just need to replace "varlist" by the respective names.

Dear Marcos,

Thank you so much for the dataex information and how to use it, I now can utilize the dataex function properly . By the way, do you have any suggestions dealing with my merging matters?

Best regards,

Dung Le
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

14 Jun 2018, 06:39

If typing - help merge - in Stata's Command Window doesn't provide a solution to you case, I recommend to use - dataex - to provide a short example of the datasets.

Best regards,

Marcos
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#5

14 Jun 2018, 06:56

I'd presume Stata is referring to file 3 when it says "pro dist comm enum_area hh_ind hh_mem_id do not uniquely identify observations." To diagnose the problem, I'd suggest you open file3, and use the -duplicates- command to find out what observations do not have unique identifying information. Something like this should work:

duplicates report pro dist comm enum_area hh_ind hh_mem_id

One potentially tricky possibility that I have seen occurs if a file has several completely blank observations, usually at the end of the file. To Stata, such observations will have the same identifying information, i.e. the "." missing value code.

Last edited by Mike Lacy; 14 Jun 2018, 06:58.
Comment
Dung Le

Join Date: May 2018

Posts: 120
#6

14 Jun 2018, 07:36

Originally posted by Mike Lacy View Post

I'd presume Stata is referring to file 3 when it says "pro dist comm enum_area hh_ind hh_mem_id do not uniquely identify observations." To diagnose the problem, I'd suggest you open file3, and use the -duplicates- command to find out what observations do not have unique identifying information. Something like this should work:

duplicates report pro dist comm enum_area hh_ind hh_mem_id

One potentially tricky possibility that I have seen occurs if a file has several completely blank observations, usually at the end of the file. To Stata, such observations will have the same identifying information, i.e. the "." missing value code.

Dear Mike,

Thank you for your codes and suggestions. I have followed your instructions and seems it worked.

Firstly, I used duplicates command to report duplicated observations and here are the results

Code:

duplicates report pro dist comm enum_area hh_ind hh_mem_id

--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 1778 0
2 | 600 300
3 | 144 96
4 | 28 21
5 | 5 4
--------------------------------------

And then, I looked at a duplicates list by using

Code:

duplicates list pro dist comm enum_area hh_ind hh_mem_id

Finally, I dropped all duplicated observations by using

Code:

duplicates drop pro dist comm enum_area hh_ind hh_mem_id, force

that resulted in 421 observations deleted. Following this, I did the same merging procedures that I mentioned in my first post and it worked. However, I have a minor question that should I use the command "keep if _merge==3" or I should not because I have seen someone advises not to do, therefore, I am quite confused. By the way, I just want to make sure that my merging procedures go to a right direction.

Thank you.

Best regards,

Dung Le
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#7

14 Jun 2018, 10:20

It's likely but not certain that "keep if _merge ==3" is what you want. Look at -help merge-, in particular the description of what the coding of the _merge variable means, and then decide on that basis if that's what you want.

Regarding your decision to drop all duplicated observations: Whether or not that's a good idea would depend on why the observations happened to be duplicated, i.e., was it just an accident, or were there perhaps corrected versions of observations entered in the data set, etc.
Comment

Announcement

How to merge multiple sub-datasets of a household survey

Comment

Comment

Comment

Comment

Comment

Comment