Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove duplicates

    Hi all,

    I need some help transforming a multiple record dataset to unique person. In my dataset only 80% of persons have id present and even when is present it may have errors in it. Therefore, to account for that I need to use the help of other identifiers such as : first name, last name and birth date which may have errors present . I am illustrating below my dataset and some of issues relating to it to help understanding.

    The way I have done the transformation was using three steps: In the first step, I identified unique persons using duplicates report and tag commands among those with id present by using the id first name, last name and birth date. During the second step, I identified unique persons among those that do not have id present based on the first name, last name and birth date only. In the last I repeat the second step by looking at all records. I am just wondering if there is a more robust method and shorter way to do it ?

    Your help is really appreciated,


    id last_name first_name birth_date
    11 Red Pencil June-01-2004
    11 Red Penicl June-01-2004
    20 Empty Box May-27-1967
    . Empty Box May-27-1967
    30 Beautiful Day April-4-1939
    . Beautiful Day April-3-1939
    14 Red Carpet September-01-2001
    41 Red Carpet September-01-2001
    14 Red Carpet September-01-2001


    Adriana

  • #2
    If you are sure that the combination of first_name last_name and birth_date identifies individually everyone (and have no missing value), thus two possibilities are open to you :

    1) Re-generate a proper individual_id, with the egen group command (it might be useful for anything else)
    Code:
    drop id
    egen id_individual = group(last_name first_name birth_date)
    This will give you an id foreach combination of these three variables, hence for each person, so in your example "empty box" will have the same id_individual, rather than one missing.
    Afterwhile if you want to remove duplicates use the duplicate drop command or the line : bysort id_indiv : keep if _n==1, that will only keep the first observation of each individual.


    2) Directly drop duplicates based on the three identification variables (first_name last_name birth_date)
    Code:
    duplicates drop first_name last_name birth_date, force
    This is simpler, however it solution doesn't guarantee that you'll have a proper individual_id, since it could kept the "empty box" observation where id is missing.

    Hope this helps
    Charlie

    Comment


    • #3
      Careful about using the -force- option with duplicates drop. If there are observations that agree on first_name last_name and birth_date but have differing information about other variables, you will end up with having arbitrarily selected one version of those other variables. If those other variables don't matter, then that's ok. But really it would be safer to check them out first:

      Code:
      unab vbles: _all
      local key first_name last_name birth_date
      local vbles: list vbles - key
      
      gen byte discrepancy = 0
      foreach v of varlist vbles {
          by `key' (`v'), sort: replace discrepancy = 1 if `v'[1] != `v'[_N]
      }
      assert discrepancy == 0
      If the -assert- statement passes, then you can just do the -duplicates drop- command without the -force- option and it will run. (For that matter, with all the variables other than first_name last_name and birth_date concordant, you can just simplify it to -duplicates drop-

      But if the -assert- statement fails, you should look at those observations where discrepancy = 1 and try to decide how to reconcile those cases. Picking one arbitrarily (which is what -duplicates drop first_name last_name birth_date, force- will do) is only occasionally a good way to handle that.

      Comment


      • #4
        Hi both,
        The first suggestion does not account for possible errors in any of the identifiers. I loved the second suggestion but I had to modify a bit since I still want to use the id to identify unique individuals . Id helps me identify reasons for the discrepancy caused by errors in the last_name first_name and date of birth. Thank you so much to both of you.

        Comment

        Working...
        X