Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data with multiple ID numbers for same observation

    Hi,

    I have a bunch of data with a bunch of problems. Each person in our data sets should have one ID number, but this is not the case. Some people have up to 5+ ID numbers. The problem that we are running into is that in different data sets, different ID numbers were used. For example one person might have ID numbers 1, 2, 3, and 4. In one data set they may be listed as "1", in another data set they may be listed as "2," and so on. We are in the process of finding all the duplicate ID numbers and making a "master ID number" for each person, but I don't know how to combine all the data once we get to that point.
    For most of the data we are concerned with capturing, it is in a "yes/no" format. So following the example above, if there is a "yes" under ANY of a given person's ID numbers for a certain variable, we want the master ID number to say "yes" for the variable.

    Any one have any ideas? Feel free to ask me more questions if this isn't clear.

    Thanks,
    Alyssa

  • #2
    Based on your description, I would append all of the datasets, set your yes variable to yes==1, no==0, and -collapse (max) all different yes variabes, by(masterID)-

    (but if you think I do not have an accurate understanding of what the data look like, please provide an example!)

    Comment


    • #3
      Hopefully your process of creating a master ID for each person will culminate in a new data set, call it id_crosswalk.dta, that crosswalks the original IDs (let's call that variable original_id) and the filename in which the original ID appears (let's call that variable source) with the master ID (let's call that variable master_id). Once you have that, you can append all your original datasets and run this:

      Code:
      // START WITH APPENDED ORIGINAL DATASETS IN MEMORY, EXPANDED
      // IF NECESSARY TO INCORPORATE THE FILENAME AS A VARIABLE source
      // SHOWING WHICH DATA SET EACH OBSERVATION COMES FROM
      rename id original_id
      merge m:1 original_id source using id_crosswalk, assert(match) nogenerate
      save combined_data_with_master_ids, replace
      Now you have a brand new data set with all observations containing both the original ID and source filename (so you can trace things back later if need be) and a unique master_id for each distinct person. Going forward with analysis, you would use the master_id variable as the person identifier.

      Comment

      Working...
      X