Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouping IDs from two variables

    Hello, all, I have a coding problem that I'm hoping smarter people than I can help solve. I am working with tax data that is at the return level. I am trying to group together returns filed by the same people, based on their unique IDs (a masked version of SSN). Each return has a primary filer ID and a spouse ID (if two people filed a joint return; if not, the spouse ID field has a missing value). It is easy to find returns filed by the same people if the same ID is primary on each return. However, I have a large number of instances where they switch places. That is, the records look like this:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float returnID str1(primID spID)
    1 "A" "B"
    2 "B" "A"
    3 "C" "D"
    4 "D" "C"
    end

    I'd like to add a "household ID" to this. At the moment, I have the following solution, which works but is very slow to run over hundreds of thousands of records:

    Code:
    gen id1 = primID
    gen id2 = spID
    reshape long id, i(returnID) j(filernum)
    
    levelsof id, local(idlevels) clean
    
    gen hhid = .
    local i = 1
    
    foreach lvl in `idlevels' {
    
    levelsof primID if id=="`lvl'" , local(prim) clean
    levelsof spID if id=="`lvl'" , local(sp) clean
    local allids `prim' `sp'
    local allids : list uniq allids
    
    tokenize `allids'
    replace hhid = `i' if (id=="`1'" | id=="`2'")  & hhid==.
    
    local i = `i'+1
    }
    
    reshape wide id, i(returnID) j(filernum)
    Also, when I really get large numbers of records, the line
    Code:
    levelsof id, local(idlevels) clean
    won't run because the local idlevels exceeds Stata's macro character limits.

    Any thoughts on a faster and neater way to do this? Thanks very much in advance!!
    Last edited by Julia Brown; 17 Jul 2025, 10:25.

  • #2
    See

    Stata Tip 71: The Problem of Split Identity, or How to Group Dyads

    Code:
    SJ-8-4  dm0043  . Tip 71: The problem of split identity, or how to group dyads
            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
            Q4/08   SJ 8(4):588--591                                 (no commands)
            tip on how to handle dyadic identifiers
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float returnID str1(primID spID)
    1 "A" "B"
    2 "B" "A"
    3 "C" "D"
    4 "D" "C"
    end
    
    gen newID = trim(cond(primID < spID, primID + " " + spID, spID + " "  + primID)) 
    
    list 
    
         +----------------------------------+
         | returnID   primID   spID   newID |
         |----------------------------------|
      1. |        1        A      B     A B |
      2. |        2        B      A     A B |
      3. |        3        C      D     C D |
      4. |        4        D      C     C D |
         +----------------------------------+

    Comment


    • #3
      Thank you very much, Nick Cox ! I had been thinking along similar lines but your way is neater and I didn't know that cond() could handle strings.

      I should have mentioned in my previous post that I also have records where people file as single sometimes and married other times, and I'd like to group those records, too, which your code doesn't do. But I think your code will reduce my sample size enough that I'll then be able to run my original idea on the remaining observations.

      (actual data is more like this:
      Code:
      clear
      input float returnID str1(primID spID)
      1 "A" "B"
      2 "B" "A"
      3 "C" "D"
      4 "D" "C"
      5 "E" "F"
      6 "F" 
      7 "G" "H" 
      8 "H" "G"
      9 "I" 
      10 "J" "I" 
      11 "K"
      12 "K" "L" 
      13 "M" "N" 
      14 "M" 
      end

      Comment


      • #4
        Well, this could be problematic. If Mr. Smith is initially unmarried, he files as a single head of household. Suppose he then marries and files jointly with his spouse. Is that the same household? Why isn't it the same household as that of the spouse before they married? And suppose that some years later they divorce, and each now file separate returns. I imagine you no longer want to link either of them to that earlier married household, but rather to their earlier separate households. So it seems to me that to do this properly you need more information: you need to know whether a switch between filing statuses is the result of a change in the household structure, or just represents a change in filing status for financial or other reasons.

        To make matters worse, suppose that after a few years of being divorced he remarries another person. Rinse and repeat.
        Last edited by Clyde Schechter; 17 Jul 2025, 12:55.

        Comment

        Working...
        X