Creating a simple ID from data with multiple time-varying aliases per observation

Apostolos Alexandridis

Join Date: Jul 2016
Posts: 2

Creating a simple ID from data with multiple time-varying aliases per observation

29 Jan 2018, 16:39

Hello. Long-time-first-time.

I'm working with prescription drug dispensing data. Each observation describes a prescription. Every observation contains an ID for an individual prescriber, along with any other IDs that are also associated with that prescriber, and a count of the total IDs. You might think of these as aliases instead of IDs. Over time, prescribers accumulate multiple aliases.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(id id_count associated_id1 associated_id2 associated_id3 associated_id4 new_id)
  2975 1     .      .     .     . 1
  2975 1     .      .     .     . 1
 51803 2 51803 110193     .     . 2
 51803 2 51803 110193     .     . 2
 44428 2 44428 109939     .     . 3
109939 2 44428 109939     .     . 3
 44428 2 44428 109939     .     . 3
 44428 1     .      .     .     . 3
110524 2 59090 110524     .     . 4
 59090 2 59090 110524     .     . 4
 26486 2 26486 109388     .     . 5
 26486 2 26486 109388     .     . 5
165920 .     .      .     .     . 6
154303 .     .      .     .     . 7
204406 .     .      .     .     . 8
 20505 4  1579  20505 30500 45800 9
 30500 4  1579  20505 30500 45800 9
  1579 .     .      .     .     . 9
 30500 3  1579  20505 30500     . 9
 45800 4  1579  20505 30500 45800 9
end

Above you can see how some observations don't reflect all possible aliases (these sample data are not sorted). Also noteworthy is that the "prime" alias ('id') is duplicated somewhere in the associated IDs.

My goal is to replace the messy situation in the first 6 variables above with a single cohesive ID that actually identifies individuals (that would resemble the 'new_id' variable above). A way to solve this problem in a computationally efficient manner (the full dataset is ~100 million records, 30-odd variables) would be an added bonus.

One possible start would be the following:

Code:

drop id_count

foreach var of varlist associated_id* {
replace `var'=. if `var'==id
}

duplicates drop

egen num=rownonmiss(id associated_id1-associated_id4)
la var num "Number of unique IDs per row"
tab num, m

* alt-id for folks with only 2 IDs
egen altid=rowtotal(id associated_id1-associated_id4) if num==2

Thanks.

Tags: data

Robert Picard

Join Date: Mar 2014
Posts: 1536

30 Jan 2018, 11:36

That's a lot of observations but I assume that the number of individual practitioners is much smaller. It makes sense to do this at the practitioner level (in its various combinations of ids) since a solution will require a fair amount of data management gymnastics. The first step is to remove duplicates in terms of all identifiers. Then a new identifier is created by grouping the variable id. (the code used is equivalent to egen long newid = group(id) but faster).

In order for this solution to work, the value of the id variable must appear in the associated id variables. You claim this is true but I check for it nevertheless and include it in associated_id1 if it is missing. I assume that if associated_id1 is missing, all other associated variables are missing.

The next step is to reshape the associated ids to a long layout. If your data is still very large at this stage, you may find that reshape is slow. If that's the case, you may want to search this forum for user-written faster reshape options.

With the data in a long layout, you will want to group newid codes when the values of associated_id match. This is tricky to do but easily achieved using group_id (from SSC). You can then remove duplicates and be left with one observation per id code with it pairing to newid. You can then merge this pairing to the original data and voila!

Note that the final codes for newid are based on the initial grouping of id and are different from the one you put together in your example (and stored in new_id). The grouping is the same however.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(id id_count associated_id1 associated_id2 associated_id3 associated_id4 new_id)
  2975 1     .      .     .     . 1
  2975 1     .      .     .     . 1
 51803 2 51803 110193     .     . 2
 51803 2 51803 110193     .     . 2
 44428 2 44428 109939     .     . 3
109939 2 44428 109939     .     . 3
 44428 2 44428 109939     .     . 3
 44428 1     .      .     .     . 3
110524 2 59090 110524     .     . 4
 59090 2 59090 110524     .     . 4
 26486 2 26486 109388     .     . 5
 26486 2 26486 109388     .     . 5
165920 .     .      .     .     . 6
154303 .     .      .     .     . 7
204406 .     .      .     .     . 8
 20505 4  1579  20505 30500 45800 9
 30500 4  1579  20505 30500 45800 9
  1579 .     .      .     .     . 9
 30500 3  1579  20505 30500     . 9
 45800 4  1579  20505 30500 45800 9
end

* save initial data example
gen long obs = _n
save "dataex.dta", replace

* remove duplicates and reduce to id variables
bysort id associated_id*: keep if _n == 1
drop id_count new_id

* create new identifier that groups variable id
sort id obs
gen long newid = sum(id != id[_n-1])

* verify that id is duplicated in associated ids
gen found = 0
foreach v of varlist associated_id* {
    replace found = 1 if `v' == id
}
assert found if !mi(associated_id1)

* add the id if there are no associated id
replace associated_id1 = id if mi(associated_id1)

* reshape to put associated ids into a long layout
keep associated_id* newid obs
reshape long associated_id, i(obs)
drop _j 
drop if mi(associated_id)
sort associated_id obs
drop obs

* group newid if the associated_id matches
group_id newid, matchby(associated_id)

* remove duplicates and merge back with initial data
rename associated_id id
bysort id: keep if _n == 1

merge 1:m id using "dataex.dta", keep(match using) nogen

* show results
sort obs
list, sepby(newid)

and the results

Code:

. list, sepby(newid)

     +--------------------------------------------------------------------------------------+
     |     id   newid   id_count   associ~1   associ~2   associ~3   associ~4   new_id   obs |
     |--------------------------------------------------------------------------------------|
  1. |   2975       2          1          .          .          .          .        1     1 |
  2. |   2975       2          1          .          .          .          .        1     2 |
     |--------------------------------------------------------------------------------------|
  3. |  51803       8          2      51803     110193          .          .        2     3 |
  4. |  51803       8          2      51803     110193          .          .        2     4 |
     |--------------------------------------------------------------------------------------|
  5. |  44428       6          2      44428     109939          .          .        3     5 |
  6. | 109939       6          2      44428     109939          .          .        3     6 |
  7. |  44428       6          2      44428     109939          .          .        3     7 |
  8. |  44428       6          1          .          .          .          .        3     8 |
     |--------------------------------------------------------------------------------------|
  9. | 110524       9          2      59090     110524          .          .        4     9 |
 10. |  59090       9          2      59090     110524          .          .        4    10 |
     |--------------------------------------------------------------------------------------|
 11. |  26486       4          2      26486     109388          .          .        5    11 |
 12. |  26486       4          2      26486     109388          .          .        5    12 |
     |--------------------------------------------------------------------------------------|
 13. | 165920      13          .          .          .          .          .        6    13 |
     |--------------------------------------------------------------------------------------|
 14. | 154303      12          .          .          .          .          .        7    14 |
     |--------------------------------------------------------------------------------------|
 15. | 204406      14          .          .          .          .          .        8    15 |
     |--------------------------------------------------------------------------------------|
 16. |  20505       1          4       1579      20505      30500      45800        9    16 |
 17. |  30500       1          4       1579      20505      30500      45800        9    17 |
 18. |   1579       1          .          .          .          .          .        9    18 |
 19. |  30500       1          3       1579      20505      30500          .        9    19 |
 20. |  45800       1          4       1579      20505      30500      45800        9    20 |
     +--------------------------------------------------------------------------------------+

.

Comment

Apostolos Alexandridis

Join Date: Jul 2016

Posts: 2
#3

30 Jan 2018, 13:56

My tentative solution as the moment is rather simple:

Code:

egen foo_id = rowmin(id associated_id1-associated_id4)

This is a bit of a shortcut, as it relies on the fact that the IDs/aliases in my data seem to be assigned sequentially, so the lowest aliases will end up being the first alias.

EDIT: Thanks for your approach, Robert! I'll need a moment to peruse it.

Last edited by Apostolos Alexandridis; 30 Jan 2018, 14:00.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

30 Jan 2018, 14:39

Yes, that simple approach will work if the aliases indeed accumulate and there's nothing to disconnect the series from previous aliases. Not sure how you would check for that other than running the solution I proposed.
1 like
Comment

Announcement

Creating a simple ID from data with multiple time-varying aliases per observation

Comment

Comment

Comment