Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging Data - Isolate non-uniquely identified observations

    Dear all,

    I am currently working on matching two databases, that entail observations on about 600'000 individuals. Some numbers of individuals are the same accross the two datasets, and the goal is to identify which ones. I use Stata 14.

    There are three common variables that are used to identify the individuals: birthdates, city of birth and first name, which are string variables.
    One individual will for example have the observation "NAME" "xx.yy.zzzz" "CITY" in both datasets.
    Please note no other variable can be used to identify the individuals, due to data limitations.

    In order to conduct the merge, I encoded the three variables to get numerical values (using encode, gen as usual)
    I merged 1:1 the three variables simultaneously.

    Code:
    use "xx1"
    keep firstname1 birthdate city_birthplace_c
    encode firstname1, gen(firstname1_num)
    encode birthdate, gen(birthdate_num)
    encode city_birthplace_c, gen(city_birthplace_c_num)
    keep firstname1_num birthdate_num city_birthplace_c_num
    save
    
    use "xx2"
    keep firstname1_c birthdate city_birthplace_c
    encode birthdate, gen(birthdate_num)
    encode city_birthplace_c, gen(city_birthplace_c_num)
    encode firstname1_c, gen(firstname1_num)
    keep birthdate_num city_birthplace_c_num firstname1_num
    
    
    use "xx1"
    merge 1:1 birthdate_num firstname1_num city_birthplace_c using "xx2"
    I get the r(459) error message: variable birthdate_num firstname1_num city_birthplace_c_num does not uniquely identify observations in the master data
    Which I understand means more than one individual has the same first name, the same birthdate and was born in the same city, which is actually quite likely due to the high number of observations.

    My question if the following: is there any comprehensive way to identify which individuals have those 3 identical observations? I've looked around on Statlist, but cannot find anything directly related to this issue.


    Thank a lot for you help,

    Best regards,

    Simon

  • #2
    Originally posted by Simon Depraz View Post
    is there any comprehensive way to identify which individuals have those 3 identical observations?
    You could use the command duplicates tag.
    Code:
    help duplicates
    for more information.

    Comment


    • #3
      Hello Jospeh,

      Thank you very much for your answer. I've considered duplicate, but cannot find a way to make it work accross the two databases. Maybe using an append command?

      Comment


      • #4
        Try this:
        Code:
        use xx1, clear
        contract firstname1 birthdate city_birthplace_c, freq(db1)
        tempfile tmpfil0
        quietly save `tmpfil0'
        use xx2
        contract firstname1 birthdate city_birthplace_c, freq(db2)
        merge 1:1 firstname1 birthdate city_birthplace_c using `tmpfil0', nogenerate
        list if db1 > 1 | db2 > 1, noobs

        Comment


        • #5
          Using encode in two separate datasets is not guaranteed to yield the same encoding in both datasets.

          There is no reason to encode your variables: key variables can be strings.

          This does not affect your duplicates problem, but once you solve that problem, you almost certainly will not be matching the observations correctly.

          Comment


          • #6
            In #4 above, you'll almost certainly need to insert
            Code:
            mvdecode db?, mv(0)
            penultimately.

            .ÿ
            .ÿversionÿ15.1

            .ÿ
            .ÿclearÿ*

            .ÿ
            .ÿinputÿbyteÿattribute

            ÿÿÿÿÿattrib~e
            ÿÿ1.ÿ1
            ÿÿ2.ÿ1
            ÿÿ3.ÿ2
            ÿÿ4.ÿend

            .ÿquietlyÿsaveÿxx1

            .ÿ
            .ÿdropÿ_all

            .ÿinputÿbyteÿattribute

            ÿÿÿÿÿattrib~e
            ÿÿ1.ÿ1
            ÿÿ2.ÿ3
            ÿÿ3.ÿ3
            ÿÿ4.ÿend

            .ÿquietlyÿsaveÿxx2

            .ÿ
            .ÿ*
            .ÿ*ÿBeginÿhere
            .ÿ*
            .ÿuseÿxx1,ÿclear

            .ÿcontractÿattribute,ÿfreq(db1)

            .ÿtempfileÿtmpfil0

            .ÿquietlyÿsaveÿ`tmpfil0'

            .ÿ
            .ÿuseÿxx2

            .ÿcontractÿattribute,ÿfreq(db2)

            .ÿmergeÿ1:1ÿattributeÿusingÿ`tmpfil0',ÿnogenerateÿnoreport

            .ÿ
            .ÿquietlyÿmvencodeÿdb?,ÿmv(0)ÿ//ÿ<=ÿhere

            .ÿlistÿifÿdb1ÿ>ÿ1ÿ|ÿdb2ÿ>ÿ1,ÿnoobsÿabbreviate(20)

            ÿÿ+-----------------------+
            ÿÿ|ÿattributeÿÿÿdb2ÿÿÿdb1ÿ|
            ÿÿ|-----------------------|
            ÿÿ|ÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿ1ÿÿÿÿÿ2ÿ|
            ÿÿ|ÿÿÿÿÿÿÿÿÿ3ÿÿÿÿÿ2ÿÿÿÿÿ0ÿ|
            ÿÿ+-----------------------+

            .ÿ
            .ÿeraseÿ`c(pwd)'\xx1.dta

            .ÿeraseÿ`c(pwd)'\xx2.dta

            .ÿ
            .ÿexit

            endÿofÿdo-file


            .

            Comment


            • #7
              Thank you very much for your answers. I'll try them.

              Comment

              Working...
              X