I have a dataset of individuals as below. In the example below I include Stata-formatted dates (*_num) as well as equivalent string-format dates (*_date) for clarity on the web here. Each observation is a record (recordID). A person is associated with one or more records (personID). Some records for a given person may be inexact duplicates, which I identify because they have similar (not exact) start and end dates. I want to identify and drop these duplicates.
I want to identify groups of records where, for a given person, both the start dates and end dates are within e.g. 7 days of each other, and then keep the record which has the latest date created (create_date or create_num).
So from the example above I would want to identity records 1 and 2 as duplicates, and then drop record 1. Similarly records 6 and 7 are duplicates, and I would want to drop record 6.
The bit I am struggling with is the identifying of the inexact duplicates. I think I could code some brute force thing myself that would do the job, but I would be horribly inefficient and I need to apply this operation to 8 datasets each of 15 million observations. So fast solutions would be particularly valuable.
Thank you very much for your time.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float recordID str1 personID float(start_num end_num create_num) str9(start_date end_date create_date) 1 "A" 20512 20759 21428 "28Feb2016" "1Nov2016" "1Sep2018" 2 "A" 20515 20763 21432 "2Mar2016" "5Nov2016" "5Sep2018" 3 "B" 20514 20635 21442 "1Mar2016" "30Jun2016" "15Sep2018" 4 "B" 20636 20878 21442 "1Jul2016" "28Feb2017" "15Sep2018" 5 "C" 20636 20878 21442 "1Jul2016" "28Feb2017" "15Sep2018" 6 "D" 20514 20544 21428 "1Mar2016" "31Mar2016" "1Sep2018" 7 "D" 20514 20548 21447 "1Mar2016" "4Apr2016" "20Sep2018" 9 "D" 20514 20575 21428 "1Mar2016" "1May2016" "1Sep2018" end format %td start_num format %td end_num format %td create_num
I want to identify groups of records where, for a given person, both the start dates and end dates are within e.g. 7 days of each other, and then keep the record which has the latest date created (create_date or create_num).
So from the example above I would want to identity records 1 and 2 as duplicates, and then drop record 1. Similarly records 6 and 7 are duplicates, and I would want to drop record 6.
The bit I am struggling with is the identifying of the inexact duplicates. I think I could code some brute force thing myself that would do the job, but I would be horribly inefficient and I need to apply this operation to 8 datasets each of 15 million observations. So fast solutions would be particularly valuable.
Thank you very much for your time.
Comment