Grouping IDs from two variables

Julia Brown

Join Date: Jan 2015

Posts: 12
#1

Grouping IDs from two variables

17 Jul 2025, 10:19

Hello, all, I have a coding problem that I'm hoping smarter people than I can help solve. I am working with tax data that is at the return level. I am trying to group together returns filed by the same people, based on their unique IDs (a masked version of SSN). Each return has a primary filer ID and a spouse ID (if two people filed a joint return; if not, the spouse ID field has a missing value). It is easy to find returns filed by the same people if the same ID is primary on each return. However, I have a large number of instances where they switch places. That is, the records look like this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float returnID str1(primID spID) 1 "A" "B" 2 "B" "A" 3 "C" "D" 4 "D" "C" end

I'd like to add a "household ID" to this. At the moment, I have the following solution, which works but is very slow to run over hundreds of thousands of records:

Code:

gen id1 = primID gen id2 = spID reshape long id, i(returnID) j(filernum) levelsof id, local(idlevels) clean gen hhid = . local i = 1 foreach lvl in `idlevels' { levelsof primID if id=="`lvl'" , local(prim) clean levelsof spID if id=="`lvl'" , local(sp) clean local allids `prim' `sp' local allids : list uniq allids tokenize `allids' replace hhid = `i' if (id=="`1'" | id=="`2'") & hhid==. local i = `i'+1 } reshape wide id, i(returnID) j(filernum)

Also, when I really get large numbers of records, the line

Code:

levelsof id, local(idlevels) clean

won't run because the local idlevels exceeds Stata's macro character limits.

Any thoughts on a faster and neater way to do this? Thanks very much in advance!!

Last edited by Julia Brown; 17 Jul 2025, 10:25.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35782

17 Jul 2025, 10:30

See

Stata Tip 71: The Problem of Split Identity, or How to Group Dyads

Code:

SJ-8-4  dm0043  . Tip 71: The problem of split identity, or how to group dyads
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q4/08   SJ 8(4):588--591                                 (no commands)
        tip on how to handle dyadic identifiers

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float returnID str1(primID spID)
1 "A" "B"
2 "B" "A"
3 "C" "D"
4 "D" "C"
end

gen newID = trim(cond(primID < spID, primID + " " + spID, spID + " "  + primID)) 

list 

     +----------------------------------+
     | returnID   primID   spID   newID |
     |----------------------------------|
  1. |        1        A      B     A B |
  2. |        2        B      A     A B |
  3. |        3        C      D     C D |
  4. |        4        D      C     C D |
     +----------------------------------+

Comment

Julia Brown

Join Date: Jan 2015

Posts: 12
#3

17 Jul 2025, 11:45

Thank you very much, Nick Cox ! I had been thinking along similar lines but your way is neater and I didn't know that cond() could handle strings.

I should have mentioned in my previous post that I also have records where people file as single sometimes and married other times, and I'd like to group those records, too, which your code doesn't do. But I think your code will reduce my sample size enough that I'll then be able to run my original idea on the remaining observations.

(actual data is more like this:

Code:

clear input float returnID str1(primID spID) 1 "A" "B" 2 "B" "A" 3 "C" "D" 4 "D" "C" 5 "E" "F" 6 "F" 7 "G" "H" 8 "H" "G" 9 "I" 10 "J" "I" 11 "K" 12 "K" "L" 13 "M" "N" 14 "M" end
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#4

17 Jul 2025, 12:51

Well, this could be problematic. If Mr. Smith is initially unmarried, he files as a single head of household. Suppose he then marries and files jointly with his spouse. Is that the same household? Why isn't it the same household as that of the spouse before they married? And suppose that some years later they divorce, and each now file separate returns. I imagine you no longer want to link either of them to that earlier married household, but rather to their earlier separate households. So it seems to me that to do this properly you need more information: you need to know whether a switch between filing statuses is the result of a change in the household structure, or just represents a change in filing status for financial or other reasons.

To make matters worse, suppose that after a few years of being divorced he remarries another person. Rinse and repeat.

Last edited by Clyde Schechter; 17 Jul 2025, 12:55.
1 like
Comment

Announcement

Grouping IDs from two variables

Comment

Comment

Comment