Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging Household Survey Data Issues

    Hi!

    I'm working on household and individual survey data. I'm trying to merge two datasets using the "merge" command for Stata and I'm having some difficulties.

    I'd like to merge two datasets using the "hhcode" variable which is the household number; however, in the master dataset, there are multiple individuals (id) in the household so the code appears multiple times. In the other dataset, the data is simply about the household (there is no id data).

    Whenever I want to merge the datasets, Stata gives me -justifiably so- the r(459) error message "variable hhcode does not uniquely identify observations in the master data".

    I don't know how I can solve/work around this problem. I've tried switching the master and using datasets and different merging configurations (1:1, 1:m, m:1) but the problem remains.

    The two datasets look like this:

    Code:
    input double(hhcode id) int year
    31010101420086 3101010142008601 2008
    31010101420086 3101010142008602 2008
    31010101420086 3101010142008603 2008
    31010101420299 3101010142029901 2008
    31010101420299 3101010142029902 2008
    31010101420299 3101010142029903 2008
    31010101420354 3101010142035401 2008
    31010101420354 3101010142035402 2008
    31010101420354 3101010142035403 2008
    31010101420762 3101010142076201 2008
    31010101420762 3101010142076202 2008
    31010101420762 3101010142076203 2008
    31010101420866 3101010142086601 2008
    31010101420866 3101010142086602 2008
    31010101420866 3101010142086603 2008
    end
    label values hhcode hhcode
    label values id id

    Code:
    input double hhcode int year
    31010101420086 2008
    31010101420299 2008
    31010101420354 2008
    31010101421029 2008
    31010101500091 2008
    31010101500112 2008
    31010101500212 2008
    31010101500365 2008
    31010101500398 2008
    31010101720165 2008
    31010101720363 2008
    31010101900030 2008
    31010101900030 2008
    31010101900226 2008
    31010300920137 2008
    end
    label values hhcode hhcode
    Thank you!

  • #2
    Well, I have never known Stata to be wrong with these messages. A look at your second dataset reveals that you do, indeed, have duplicate observations:

    Code:
                hhcode   year  
        31010101900030   2008
    appears twice, in observations 12 and 13.

    It is hard to spot these things with the eye. I suggest you run:

    Code:
    duplicates report hhcode
    duplicates tag hhcode, gen(flag)
    sort hhcode
    browse if flag
    to get a sense of how big a duplicate problem you have on your hands and to see the offending observations.

    Then, since you said this data set is supposed to only contain one observation per household, you need to figure out how to fix it. If the observations with duplicate values of hhcode agree on all variables, then a quick fix is to just run -duplicates drop- and you'll be rid of them. The problem with that is that if the data set is expected to have only one observation per hhcode, but doesn't conform to that, then something is wrong with the way the data set was built. So then you need to review the data management steps that led up to this and see if they might cause other things in the data set to be incorrect, but not so easily stumbled upon. (Or, if you didn't build this data set, ask whoever did to do this.) If, of course, the observations with duplicate hhcode disagree on some other variables, then you need to figure out how to resolve the conflict: which one to keep, or perhaps to combine them in someway, or perhaps to drop all of them!


    Comment


    • #3
      Dear Clyde,

      Thank you for your help!

      As you advised, I ran the code you provided. Apparently, there are multiple observations because there are questions about children, spouse (etc) which add members to households; therefore creating duplicate household numbers. However, these family members are coded by some other variables in the survey questionnaire and I did not spot that initially.

      This is the Chinese Household Income Project data (wave 2007), and I'm not the one who built it. I will keep only one family member per household since I don't need more for my study and will drop the rest, getting rid of the duplicates.

      Comment

      Working...
      X