Hello. Long-time-first-time.
I'm working with prescription drug dispensing data. Each observation describes a prescription. Every observation contains an ID for an individual prescriber, along with any other IDs that are also associated with that prescriber, and a count of the total IDs. You might think of these as aliases instead of IDs. Over time, prescribers accumulate multiple aliases.
Above you can see how some observations don't reflect all possible aliases (these sample data are not sorted). Also noteworthy is that the "prime" alias ('id') is duplicated somewhere in the associated IDs.
My goal is to replace the messy situation in the first 6 variables above with a single cohesive ID that actually identifies individuals (that would resemble the 'new_id' variable above). A way to solve this problem in a computationally efficient manner (the full dataset is ~100 million records, 30-odd variables) would be an added bonus.
One possible start would be the following:
Thanks.
I'm working with prescription drug dispensing data. Each observation describes a prescription. Every observation contains an ID for an individual prescriber, along with any other IDs that are also associated with that prescriber, and a count of the total IDs. You might think of these as aliases instead of IDs. Over time, prescribers accumulate multiple aliases.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input double(id id_count associated_id1 associated_id2 associated_id3 associated_id4 new_id) 2975 1 . . . . 1 2975 1 . . . . 1 51803 2 51803 110193 . . 2 51803 2 51803 110193 . . 2 44428 2 44428 109939 . . 3 109939 2 44428 109939 . . 3 44428 2 44428 109939 . . 3 44428 1 . . . . 3 110524 2 59090 110524 . . 4 59090 2 59090 110524 . . 4 26486 2 26486 109388 . . 5 26486 2 26486 109388 . . 5 165920 . . . . . 6 154303 . . . . . 7 204406 . . . . . 8 20505 4 1579 20505 30500 45800 9 30500 4 1579 20505 30500 45800 9 1579 . . . . . 9 30500 3 1579 20505 30500 . 9 45800 4 1579 20505 30500 45800 9 end
Above you can see how some observations don't reflect all possible aliases (these sample data are not sorted). Also noteworthy is that the "prime" alias ('id') is duplicated somewhere in the associated IDs.
My goal is to replace the messy situation in the first 6 variables above with a single cohesive ID that actually identifies individuals (that would resemble the 'new_id' variable above). A way to solve this problem in a computationally efficient manner (the full dataset is ~100 million records, 30-odd variables) would be an added bonus.
One possible start would be the following:
Code:
drop id_count foreach var of varlist associated_id* { replace `var'=. if `var'==id } duplicates drop egen num=rownonmiss(id associated_id1-associated_id4) la var num "Number of unique IDs per row" tab num, m * alt-id for folks with only 2 IDs egen altid=rowtotal(id associated_id1-associated_id4) if num==2
Thanks.
Comment