Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a simple ID from data with multiple time-varying aliases per observation

    Hello. Long-time-first-time.

    I'm working with prescription drug dispensing data. Each observation describes a prescription. Every observation contains an ID for an individual prescriber, along with any other IDs that are also associated with that prescriber, and a count of the total IDs. You might think of these as aliases instead of IDs. Over time, prescribers accumulate multiple aliases.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(id id_count associated_id1 associated_id2 associated_id3 associated_id4 new_id)
      2975 1     .      .     .     . 1
      2975 1     .      .     .     . 1
     51803 2 51803 110193     .     . 2
     51803 2 51803 110193     .     . 2
     44428 2 44428 109939     .     . 3
    109939 2 44428 109939     .     . 3
     44428 2 44428 109939     .     . 3
     44428 1     .      .     .     . 3
    110524 2 59090 110524     .     . 4
     59090 2 59090 110524     .     . 4
     26486 2 26486 109388     .     . 5
     26486 2 26486 109388     .     . 5
    165920 .     .      .     .     . 6
    154303 .     .      .     .     . 7
    204406 .     .      .     .     . 8
     20505 4  1579  20505 30500 45800 9
     30500 4  1579  20505 30500 45800 9
      1579 .     .      .     .     . 9
     30500 3  1579  20505 30500     . 9
     45800 4  1579  20505 30500 45800 9
    end


    Above you can see how some observations don't reflect all possible aliases (these sample data are not sorted). Also noteworthy is that the "prime" alias ('id') is duplicated somewhere in the associated IDs.

    My goal is to replace the messy situation in the first 6 variables above with a single cohesive ID that actually identifies individuals (that would resemble the 'new_id' variable above). A way to solve this problem in a computationally efficient manner (the full dataset is ~100 million records, 30-odd variables) would be an added bonus.

    One possible start would be the following:
    Code:
    drop id_count
    
    foreach var of varlist associated_id* {
    replace `var'=. if `var'==id
    }
    
    duplicates drop
    
    egen num=rownonmiss(id associated_id1-associated_id4)
    la var num "Number of unique IDs per row"
    tab num, m
    
    * alt-id for folks with only 2 IDs
    egen altid=rowtotal(id associated_id1-associated_id4) if num==2


    Thanks.

  • #2
    That's a lot of observations but I assume that the number of individual practitioners is much smaller. It makes sense to do this at the practitioner level (in its various combinations of ids) since a solution will require a fair amount of data management gymnastics. The first step is to remove duplicates in terms of all identifiers. Then a new identifier is created by grouping the variable id. (the code used is equivalent to egen long newid = group(id) but faster).

    In order for this solution to work, the value of the id variable must appear in the associated id variables. You claim this is true but I check for it nevertheless and include it in associated_id1 if it is missing. I assume that if associated_id1 is missing, all other associated variables are missing.

    The next step is to reshape the associated ids to a long layout. If your data is still very large at this stage, you may find that reshape is slow. If that's the case, you may want to search this forum for user-written faster reshape options.

    With the data in a long layout, you will want to group newid codes when the values of associated_id match. This is tricky to do but easily achieved using group_id (from SSC). You can then remove duplicates and be left with one observation per id code with it pairing to newid. You can then merge this pairing to the original data and voila!

    Note that the final codes for newid are based on the initial grouping of id and are different from the one you put together in your example (and stored in new_id). The grouping is the same however.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(id id_count associated_id1 associated_id2 associated_id3 associated_id4 new_id)
      2975 1     .      .     .     . 1
      2975 1     .      .     .     . 1
     51803 2 51803 110193     .     . 2
     51803 2 51803 110193     .     . 2
     44428 2 44428 109939     .     . 3
    109939 2 44428 109939     .     . 3
     44428 2 44428 109939     .     . 3
     44428 1     .      .     .     . 3
    110524 2 59090 110524     .     . 4
     59090 2 59090 110524     .     . 4
     26486 2 26486 109388     .     . 5
     26486 2 26486 109388     .     . 5
    165920 .     .      .     .     . 6
    154303 .     .      .     .     . 7
    204406 .     .      .     .     . 8
     20505 4  1579  20505 30500 45800 9
     30500 4  1579  20505 30500 45800 9
      1579 .     .      .     .     . 9
     30500 3  1579  20505 30500     . 9
     45800 4  1579  20505 30500 45800 9
    end
    
    * save initial data example
    gen long obs = _n
    save "dataex.dta", replace
    
    * remove duplicates and reduce to id variables
    bysort id associated_id*: keep if _n == 1
    drop id_count new_id
    
    * create new identifier that groups variable id
    sort id obs
    gen long newid = sum(id != id[_n-1])
    
    * verify that id is duplicated in associated ids
    gen found = 0
    foreach v of varlist associated_id* {
        replace found = 1 if `v' == id
    }
    assert found if !mi(associated_id1)
    
    * add the id if there are no associated id
    replace associated_id1 = id if mi(associated_id1)
    
    * reshape to put associated ids into a long layout
    keep associated_id* newid obs
    reshape long associated_id, i(obs)
    drop _j 
    drop if mi(associated_id)
    sort associated_id obs
    drop obs
    
    * group newid if the associated_id matches
    group_id newid, matchby(associated_id)
    
    * remove duplicates and merge back with initial data
    rename associated_id id
    bysort id: keep if _n == 1
    
    merge 1:m id using "dataex.dta", keep(match using) nogen
    
    * show results
    sort obs
    list, sepby(newid)
    and the results
    Code:
    . list, sepby(newid)
    
         +--------------------------------------------------------------------------------------+
         |     id   newid   id_count   associ~1   associ~2   associ~3   associ~4   new_id   obs |
         |--------------------------------------------------------------------------------------|
      1. |   2975       2          1          .          .          .          .        1     1 |
      2. |   2975       2          1          .          .          .          .        1     2 |
         |--------------------------------------------------------------------------------------|
      3. |  51803       8          2      51803     110193          .          .        2     3 |
      4. |  51803       8          2      51803     110193          .          .        2     4 |
         |--------------------------------------------------------------------------------------|
      5. |  44428       6          2      44428     109939          .          .        3     5 |
      6. | 109939       6          2      44428     109939          .          .        3     6 |
      7. |  44428       6          2      44428     109939          .          .        3     7 |
      8. |  44428       6          1          .          .          .          .        3     8 |
         |--------------------------------------------------------------------------------------|
      9. | 110524       9          2      59090     110524          .          .        4     9 |
     10. |  59090       9          2      59090     110524          .          .        4    10 |
         |--------------------------------------------------------------------------------------|
     11. |  26486       4          2      26486     109388          .          .        5    11 |
     12. |  26486       4          2      26486     109388          .          .        5    12 |
         |--------------------------------------------------------------------------------------|
     13. | 165920      13          .          .          .          .          .        6    13 |
         |--------------------------------------------------------------------------------------|
     14. | 154303      12          .          .          .          .          .        7    14 |
         |--------------------------------------------------------------------------------------|
     15. | 204406      14          .          .          .          .          .        8    15 |
         |--------------------------------------------------------------------------------------|
     16. |  20505       1          4       1579      20505      30500      45800        9    16 |
     17. |  30500       1          4       1579      20505      30500      45800        9    17 |
     18. |   1579       1          .          .          .          .          .        9    18 |
     19. |  30500       1          3       1579      20505      30500          .        9    19 |
     20. |  45800       1          4       1579      20505      30500      45800        9    20 |
         +--------------------------------------------------------------------------------------+
    
    .

    Comment


    • #3
      My tentative solution as the moment is rather simple:

      Code:
      egen foo_id = rowmin(id associated_id1-associated_id4)
      This is a bit of a shortcut, as it relies on the fact that the IDs/aliases in my data seem to be assigned sequentially, so the lowest aliases will end up being the first alias.

      EDIT: Thanks for your approach, Robert! I'll need a moment to peruse it.
      Last edited by Apostolos Alexandridis; 30 Jan 2018, 14:00.

      Comment


      • #4
        Yes, that simple approach will work if the aliases indeed accumulate and there's nothing to disconnect the series from previous aliases. Not sure how you would check for that other than running the solution I proposed.

        Comment

        Working...
        X