Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Changing patients IDs from string to numeric

    Hi all,

    I am working on a big dataset that includes more than a million patients with unique fake IDs. The IDs take the form of "map_code_111000" and are stored as strings. I want to convert them to a numeric format. What would be the best approach to tackle this? I was thinking of either removing the "map_code_" part or extracting the number "111000". Is this an acceptable approach? If yes, what function or syntax should I use? If not, what approach do you recommend?

  • #2
    Among the safest ways:
    Code:
    assert !missing(patient_id)
    
    preserve
    
    contract patient_id, freq(pid)
    quietly replace pid = _n // Stata will promote to -long- storage type
    assert !missng(pid)
    
    tempfile patient_list
    quietly save `patient_list'
    
    restore
    merge m:1 patient_id using `patient_list', assert(match) nogenerate noreport
    If you want to keep your patient ID crosswalk table for later reference, then change the temporary file in the code snippet above to a permanent one.
    Last edited by Joseph Coveney; 12 Oct 2023, 00:58. Reason: generate(pid) → freq(pid)

    Comment


    • #3
      You can also use egen . . . tag() after sorting the dataset, and follow-up with the sum() function. You won't automatically get a crosswalk table as an intermediate result. If that's important to you. then you can get it afterward with a bysort patient_id: keep if _n == 1.

      Comment

      Working...
      X