Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a dummy variable to match Twins in repeated cross-section census data

    Good Day,

    I am using Stata 17. I am having trouble attempting to match potential twin pairs using IPUMS USA ACS micro data (dataex sample below). My dataset has 90 millions observations, the sample below does not include any obvious twin pairs but the entire sample will. I want to create a dummy variable called "twin" that takes on a value of one if two conditions are met:

    1. serial (family identifier, type = double) is the same.
    2. age (age of person, type = integer) is the same.

    Essentially, if two or more observations have the same "serial" and "age" value I would like a "twin" dummy variable created with a value of one and otherwise assigned a value of zero. I have tried creating a twin variable and replacing using if conditions (1). The replace method does not work and gives all observations a value of 1 for the variable "twin". I have created a unique id "id" for each individual and was wondering if there is a stata command that would allow me to match using this "id" and conditions. I have been looking into the command vmatch but can not get it to do what I want.

    (1) gen twin = 0
    replace twin = 1 if serial == serial | age == age
    year serial stateicp pernum sex age twin id
    2017 4 Alabama 4 Female 7 1 1
    2017 11 Alabama 5 Female 8 1 2
    2017 11 Alabama 4 Male 15 1 3
    2017 13 Alabama 4 Male 12 1 4
    2017 13 Alabama 3 Male 13 1 5
    2017 18 Alabama 3 Female 11 1 6
    2017 21 Alabama 4 Male 13 1 7
    2017 22 Alabama 5 Male 9 1 8
    2017 22 Alabama 4 Female 12 1 9
    2017 22 Alabama 3 Female 13 1 10
    2017 23 Alabama 3 Male 7 1 11
    2017 28 Alabama 4 Female 14 1 12
    2017 28 Alabama 3 Female 15 1 13
    2017 29 Alabama 4 Female 7 1 14
    2017 29 Alabama 3 Female 10 1 15
    2017 39 Alabama 3 Male 8 1 16
    2017 41 Alabama 4 Female 13 1 17
    2017 41 Alabama 3 Male 15 1 18
    2017 46 Alabama 6 Male 8 1 19
    2017 46 Alabama 5 Male 9 1 20
    2017 46 Alabama 4 Male 14 1 21
    2017 46 Alabama 3 Male 15 1 22
    2017 52 Alabama 4 Female 11 1 23
    2017 53 Alabama 3 Female 7 1 24
    2017 55 Alabama 2 Male 10 1 25
    Thank you,

    Michael

  • #2
    Do you have a date of birth? Age can incorrectly mark "Irish twins" as twins: https://www.thebump.com/a/irish-twins.

    Comment


    • #3
      I don't buy your logic. If you mark twins whenever either the serial values or the age values or the same, then all people in the same household are twins, and all people who are the same age are twins.

      I think you mean to mark twins whenever both serial and age values are the same. That would be two people of the same age in the same household. While it isn't perfect--one might have two children born 9 months apart who are the same integer age on the census date. Or you might have a household that includes an unrelated person who is the same age as a family member. But there is probably no way around these particular problems in the kind of data you show.

      There is an easy way to get your twin variable here:
      Code:
      duplicates tag serial age, gen(twin)
      As you point out, there are no instances of twins in the example data--an example that included some would have been better, so one could truly test that this code works.

      Also, you refer to your display as a -dataex- example, but it clearly is not from -dataex-. In this instance, your display was usable anyway. But in the future, when showing example data, please do use -dataex- to do so. There will be situations where not doing so will make it difficult or impossible to resolve your question.

      Added: Crossed with #2.

      Comment


      • #4
        Thank you Clyde and Andrew! I was able to get the results I wanted. Yes, I meant if both serial and age are the same value (sorry for the unclear writing). While I don't have date of birth I have month of birth, which I added to Clyde's code to rule out Irish twins (hopefully). The code I ended up running is:

        Code:
        duplicates tag cbserial age birthmo, gen(twin).

        cbserial is a version of serial that distinguishes between years (I initially did not realize that serial was reused in different years). There were even a few triplets and quadruplets found. I am new to statalist and copied my dataex output to excel to copy and paste here. I will brush up my knowledge on the proper use of dataex.

        Comment

        Working...
        X