Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge: variable does not uniquely identify observations in the master data

    Hi everyone,

    I have a data with a variable household ID that does not uniquely identify observations. So a single household has multiple members. I want to merge it with another data that has the same household ID but it uniquely identifies observations. I have merged the using data to the masters data with the following command:

    merge 1:m HH_ID using education

    where education is my using data with observations not uniquely identified. It worked and now I have a merged dataset. But now there are some observations which are not matched and which only exist in the master data. So are these duplicates? Should I drop them and only keep the matched observations?

  • #2
    The unmatched observations correspond to values of HH_ID that occur in the master data only: they are not duplicates. They are simply households that, for whatever reason, were not included in the using data.

    As for what to do about that, it depends on why this mismatching occurred and what you plan to do with the merged data. Several possibilities:

    1. These unmatched HH_IDs in the master data could be data errors: perhaps the HH_ID's were mistyped when the data were entered, or somehow got modified during data management to non-existent households.

    2. The unmatched HH_IDs might be real households about which data really was collected but they were for some reason not included in the data gathering for the using data. You would have to consult the documentation accompanying the survey data to understand what the inclusion criteria for each data set were. Perhaps the variables in the using data set are not applicable to those households.

    3. The unmatched HH_IDs might be real households about which data was collected, but they fail to appear in the using data because the using data contains errors in the HH_ID variable, so that those values were mistyped or mistakenly delated during data management. If the survey documentation implies that the same households should appear in both data sets, then you would need to consult with the people who created these data sets to find out why the data does not work as advertised.

    After ruling out data errors as the source of the problem, then whether to drop the unmatched observations depends on what you will be doing with the merged data set. If you plan to do regression analyses that involve variables in the using data, those observations will be deleted during the regression by Stata anyway because only complete cases are included in regression calculations. So in this case you could delete them or leave them in and it will make no difference for the regressions. But for analyses involving only variables in the master data set, you need to decide whether the most relevant analysis is one involving all HH_IDs or only those that have complete data for all analyses (including those with using data set variables). Only you know wnough about your project and its purposes to decide that.

    Depending on the reasons that those households are missing from the using data, you might try to do multiple imputation to fill out the missing information and get unbiased regression estimates. In hat case, you would need to retain these observations to use in all the analyses.

    Comment


    • #3
      Thank you! Is there a way to check if the unmatched HH_IDs are data errors on STATA? And is there a way I can compare the households IDs in the master and using data file to see if some only exist in the master data? The 3rd case seems highly unlikely as it is a national level survey.

      Comment


      • #4
        Is there a way to check if the unmatched HH_IDs are data errors on STATA? And is there a way I can compare the households IDs in the master and using data file to see if some only exist in the master data?
        Well, you have already stated in #1 that you found unmatched HH_IDs that are in the master data but not the using data. That is not the question. The question is whether it is supposed to be that way or whether this represents some kind of error. There is nothing that Stata can do to distinguish these possibilities. It requires understanding the processes by which the data sets were created. The first question I would ask you is whether these data sets you are working with are the original data sets provided to you,or whether you have made modifications to them in some way? If the latter, then you need to review whatever data management you did to see if it could have either created spurious HH_IDs in the master or removed some HH_IDs from the using.

        The survey documentation that usually comes with national surveys usually contains some statements about these matters and I would start by consulting that. If the documentation says nothing about it, you might want to contact the survey administrator and see if they can clarify what is happening.

        Comment


        • #5
          Okay, thankyou.

          Comment

          Working...
          X