Generating a unique case id

Adriana Peci

Join Date: Nov 2014

Posts: 32
#1

Generating a unique case id

04 Mar 2016, 14:09

I need some help to create a unique case id per patient. My dataset have duplicate records per patient. I am trying to identify unique patients by patient health card number (HC) and if HC number is missing by patient's last name & first_name & birth_date. In the dataset, I may have inconsistent entry of HC numbers meaning that some patients maybe recorded with complete information (HC, first_name, last_name and date of birth) in one case and in another instance without HC information. Similar situations may ocurr with various errors in first name or last name or date of birth. I want to create a unique patient id that accounts for all of these issues. I will provide a dataset to illustrate my situation. hc last_name first_name birth_date case_id
-------------------------------------------------------------------------
1. 333 Smith John Jan 1 2007 1
2. . Smith John Jan 1 2007 1
3. 547 Marek Rob Jun 12 2007 2
4. 487 Red Flower May 2 2002 3
5. 487 Red White(Flower) May 2 2002 3
-------------------------------------------------------------------------
6. 333 Smith Jon Jan 1 2007 1
+------------------------------------------------------------------------+

Thank you in advance for all your time and efforts.

Adriana Peci
Tags: None
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#2

04 Mar 2016, 14:40

Although you have now posted this in the correct forum, you did not follow the instructions I provided yesterday:

You want to post this in the General Forum and please read the FAQ, especially on how to post data samples using dataex (read FAQ #12). This forum is for how to use the forums.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

04 Mar 2016, 14:46

Well, it's easy to get started on this:

Code:

egen unique_id = group(HC last_name first_name birth_date) by last_name first_name birth_date (unique_id), sort: replace unique_id = unique_id[1]

This will spread the same unique ID to all observations with the same last name, first name, and birth date.

From that point on it gets (potentially much) harder. You now need to deal with omissions, incorrect dates, and name spelling errors or variants. Inexact matching is a topic that comes up frequently on the forum. It is a very broad topic, and there is no single simple solution. The data example you provide has only one instance of name variation, which is simply not enough to give us a sense of the types of mismatches you are facing and how numerous they are. Sometimes you are lucky and the number of aberrant cases is small enough that you can just edit the data by hand. Some times you have to resort to trying to match up names using the -soundex()- function or -strdist- command. Since there are also birthdates in the mix, a more general purpose inexact matching tool such as -reclink- (available from SSC).

Finally, for future reference, the preferred method for showing sample data is with the -dataex- command. If you don't have it already, run -ssc install dataex-. -help dataex- provides the simple directions for using it. Doing that will enable those who might want to test out code on your data to replicate your sample data quickly, easily, and faithfully. Please use it going forward when you need to show sample data.

Added note: crossed in cyberspace with Carole Wilson's response. And I was not aware that this had been previously posted elsewhere.
Comment
Adriana Peci

Join Date: Nov 2014

Posts: 32
#4

09 Mar 2016, 10:40

Thank you so much for your helpful response. Sorry I was unable to respond earlier.
Next time I will make sure to use the dataex command.

Adriana
Comment
Adriana Peci

Join Date: Nov 2014

Posts: 32
#5

15 Mar 2016, 11:58

Hi there, I tried your code above and it did work generally well when you have no missing information. However, there were two issues: In situation where there was missing HC number or any of the other identifiers, the code didn't create a unique id at all. In addition, patients with the same HC number but error in birthday were counted as two different people. Not sure how you can fix it ?? Thanks for any advice
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

15 Mar 2016, 12:42

Well, as I said in #3, my solution was only partial and only applicable to people with relatively clean data, and from that point forward it gets more complicated. There are no completely general solutions to the remainder of your problem. If you have only a small number of problem cases, the best solution may just be to fix them manually with a series of -replace- statements, or in the Data Editor (with, in either case, logging turned on so you have a record of what you did).

If the remaining problems are too numerous for manual repair, Michael Blasnik's -reclink- program (-ssc install reclink-) is one approach that I have often found helpful when "fuzzy" matching is needed. But it can be complicated to use and there is no guarantee of success. I believe there are other user-written programs for fuzzy matching, but I am not really familiar with them.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

15 Mar 2016, 13:12

Unfortunately wanting a program to give the right answer even with the wrong data requires inbuilt intelligence about what is wrong and how to put it right.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

15 Mar 2016, 13:26

I would use a gradual approach that builds on fair assumptions about the data. The health card number surely identify patients unless there were typos when entered in the database. Similarly, if the first, last, and dob match, it's likely the same person. You can use group_id (from SSC) to build a case identifier and check at each step for inconsistencies. To install group_id, type in Stata Command window

Code:

ssc install group_id

Here's a slightly modified data example to show some possible bad groupings and how to identify them

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int hc str5 last_name str13 first_name str11 birth_date
333 "Smith" "John"          "Jan 1 2007"
  . "Smith" "John"          "Jan 1 2007"
547 "Marek" "Rob"           "Jun 12 2007"
487 "Red"   "Flower"        "May 2 2002"
487 "Red"   "White(Flower)" "May 2 2002"
888 "Snow"  "John"          "Jan 1 2000"
333 "Smith" "Jon"           "Jan 1 2007"
383 "Smith" "Jon"           "Jan 1 2007"
547 "Marek" "Rob"           ""
  . "Snow"  "John"          "Jan 1 2000"
888 "Snow"  "John"          "Jan 12 2000"
end

* string data is usually easier to work with if it is left-aligned
* to install, type in Stata's Command window: ssc install leftalign
leftalign

* observations should always have a unique identifier
gen long obs_id = _n

* initially, each observation has a unique case identifier
gen long cid = obs_id

* group case identifier for obs with the same hc code
group_id cid, match(hc)

* group case identifier for obs with the same name and dob info
group_id cid, match(last_name first_name birth_date)

* check case identifier codes with multiple hc codes
bysort cid (hc obs_id): egen check1 = total(hc != hc[1] & !mi(hc))

* check case identifier codes with different dob (assumes that it is string)
bysort cid (birth_date obs_id): egen check2 = total(birth_date != birth_date[_N] & !mi(birth_date))

and the results

Code:

. list, sepby(cid)

     +-------------------------------------------------------------------------------+
     |  hc   last_n~e   first_name      birth_date    obs_id   cid   check1   check2 |
     |-------------------------------------------------------------------------------|
  1. | 333   Smith      John            Jan 1 2007         1     1        1        0 |
  2. |   .   Smith      John            Jan 1 2007         2     1        1        0 |
  3. | 333   Smith      Jon             Jan 1 2007         7     1        1        0 |
  4. | 383   Smith      Jon             Jan 1 2007         8     1        1        0 |
     |-------------------------------------------------------------------------------|
  5. | 547   Marek      Rob                                9     3        0        0 |
  6. | 547   Marek      Rob             Jun 12 2007        3     3        0        0 |
     |-------------------------------------------------------------------------------|
  7. | 487   Red        Flower          May 2 2002         4     4        0        0 |
  8. | 487   Red        White(Flower)   May 2 2002         5     4        0        0 |
     |-------------------------------------------------------------------------------|
  9. | 888   Snow       John            Jan 1 2000         6     6        0        2 |
 10. |   .   Snow       John            Jan 1 2000        10     6        0        2 |
 11. | 888   Snow       John            Jan 12 2000       11     6        0        2 |
     +-------------------------------------------------------------------------------+

Last edited by Robert Picard; 15 Mar 2016, 13:30.

Comment

Adriana Peci

Join Date: Nov 2014

Posts: 32
#9

17 Mar 2016, 07:33

Hi Robert,
I tried your step wise code and I am so happy . Not only helps me create a unique id but also helps me identify all the cases that have errors in unique identifiers which is pretty common thing in a public health database.
Thank you so much to you and all of others who tried to help.
You really make our work easier!

Best regards,

Adriana
Comment

Announcement