Merging survey data same variables in two separate files with different structure

Mohamud Hussein

Join Date: Jan 2015
Posts: 99

Merging survey data same variables in two separate files with different structure

03 Feb 2019, 02:34

Hello Members,

I would like to seek your help on problem encountered in attempting to merge file containing variables for the same a unit of observation (household).

I am trying to use household ID (hh) and other household identifiers (ae, astrata, type) which together should allow to identifying a household uniquely. As shown below the variables are described the same across the two files but have different no. observations, so cannot be matched one-to-one basis to merge files.

Could you suggest any trick to overcome the problem?

Thanks,
Mohamud

FILE 1

Code:

 strata
byte %8.0g strata_id


strata_id

ea
int %8.0g EA ID

block
byte %8.0g Block ID

hh
byte %8.0g Household ID

type
byte %8.0g type_hh Urban/Rural/IDP or Nomad

astrata
byte %8.0g lastrata Analytical strata

weight
float %9.0g Household weight

hhh_edu
byte %8.0g edu_level_broad


Education Level of Household
Head

hhh_age
byte %8.0g Age of Household Head

hhh_gender
byte %8.0g hhm_gender


Gender of Household Head

. sum strata
ea
block hh type astrata weight hhh_edu hhh_age hhh_gender

Variable
Obs Mean Std. Dev. Min Max



strata
6,092 34.24606 14.62617 1 57

ea
6,092 209.0635 114.3199 1 413

block
6,092 31.29711 8.778698 1 69

hh
6,092 9.810407 8.249635 1 60

type
6,092 1.584865 .9465371 1 4



astrata
6,092 7.245732 4.145188 1 15

weight
6,092 318.3864 656.6932 .1215384 10705.87

hhh_edu
4,279 1.332554 1.987206 0 6

hhh_age
6,091 37.79182 11.93012 16 90

hhh_gender
6,092 .5041037 .5000242 0 1

FILE 2

Code:

 describe strata ea block hh astrata

storage display value

variable name type format label variable label



strata int %16.0g Strata

ea int %16.0g EA ID

block byte %16.0g Block ID

hh byte %16.0g Household ID

astrata byte %16.0g astrata Analytical Strata

. sum strata ea block hh astrata

Variable Obs Mean Std. Dev. Min
Max



strata 474,025 256.171 180.3539 101
1303

ea 474,025 171.3944 97.98881 1
341

block 139,430 19.75147 7.056427 1
44

hh 474,025 6.567827 3.51177 1
19

astrata 474,025 13.83234 5.129155 3
22

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

03 Feb 2019, 06:11

It appears to me that your first dataset is data on each household, 1 observation per household, while your second dataset contains observations of characteristics within each household, several observations per household. If that is the case, then something like the following will perhaps start you on your way.

Code:

use dataset1 merge 1:m strata ea block hh using dataset2

followed by careful review of the output from the merge command to see if there were mismatches: households with no details, or details whose household was not found.

The key is that "1:m" tells merge that dataset1 will have 1 observation for each household, while dataset2 will have multiple observations for each household. You tried "1:1" which tells merge that both datasets have a single observation for each household.

The documentation for the merge command in the Stata Data Management Reference Manual PDF is useful background to making effective use of the merge command.
Comment
Mohamud Hussein

Join Date: Jan 2015

Posts: 99
#3

03 Feb 2019, 07:01

Thank you William- that's very helpful William.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

03 Feb 2019, 09:25

Mohamud Hussein - One more thought has occurred to me.

My experience with survey data is that I would expect your enumeration area ID (ea), block ID (block), and household ID (hh) together to be sufficient to identify your households. So I suggest you try the following.

Code:

use dataset1 isid ea block hh isid block hh

This will test if ea block and hh together are distinct - only one observation for any combination. Here's sample output on made up data to demonstrate what I expect the output to look like.

Code:

. isid ea block hh . isid block hh variables block hh do not uniquely identify the observations r(459);

From the second isid we are told that block and hh together (without including ea) do not uniquely identify observations. From the first isid we see nothing - and lack of an error message means that ea block and hh together do uniquely identify the observations, and no other variable (like strata or astrata or type) is necessary.

If you see similar results, that means my code in post #2 can instead be

Code:

use dataset1 merge 1:m ea block hh using dataset2

Now, if you do that, you should know that since astrata is in both dataset1 and dataset2, when the merge is run, the observations that matched on ea block and hh will retain the value of astrata that was in dataset1, and will ignore the value from dataset2. Which is what you want, I'm sure. But if you ever want to use merge to update (replace values in) one dataset with values from another, it's back to the merge documentation to understand your options.

I hope you find this more helpful than confusing.
Comment
Mohamud Hussein

Join Date: Jan 2015

Posts: 99
#5

03 Feb 2019, 23:50

Many thanks William. Your suggestions are very useful and clear. I appreciate it.
Comment

Announcement

Merging survey data same variables in two separate files with different structure

Comment

Comment

Comment

Comment