Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying "fuzzy" duplicates using a numeric range

    I have a dataset of individuals as below. In the example below I include Stata-formatted dates (*_num) as well as equivalent string-format dates (*_date) for clarity on the web here. Each observation is a record (recordID). A person is associated with one or more records (personID). Some records for a given person may be inexact duplicates, which I identify because they have similar (not exact) start and end dates. I want to identify and drop these duplicates.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float recordID str1 personID float(start_num end_num create_num) str9(start_date end_date create_date)
    1 "A" 20512 20759 21428 "28Feb2016" "1Nov2016"  "1Sep2018" 
    2 "A" 20515 20763 21432 "2Mar2016"  "5Nov2016"  "5Sep2018" 
    3 "B" 20514 20635 21442 "1Mar2016"  "30Jun2016" "15Sep2018"
    4 "B" 20636 20878 21442 "1Jul2016"  "28Feb2017" "15Sep2018"
    5 "C" 20636 20878 21442 "1Jul2016"  "28Feb2017" "15Sep2018"
    6 "D" 20514 20544 21428 "1Mar2016"  "31Mar2016" "1Sep2018" 
    7 "D" 20514 20548 21447 "1Mar2016"  "4Apr2016"  "20Sep2018"
    9 "D" 20514 20575 21428 "1Mar2016"  "1May2016"  "1Sep2018" 
    end
    format %td start_num
    format %td end_num
    format %td create_num

    I want to identify groups of records where, for a given person, both the start dates and end dates are within e.g. 7 days of each other, and then keep the record which has the latest date created (create_date or create_num).

    So from the example above I would want to identity records 1 and 2 as duplicates, and then drop record 1. Similarly records 6 and 7 are duplicates, and I would want to drop record 6.

    The bit I am struggling with is the identifying of the inexact duplicates. I think I could code some brute force thing myself that would do the job, but I would be horribly inefficient and I need to apply this operation to 8 datasets each of 15 million observations. So fast solutions would be particularly valuable.

    Thank you very much for your time.



  • #2
    Just politely seeking to bump this for the weekend and different timezones.

    Comment


    • #3
      Originally posted by Walter Bijou View Post
      Just politely seeking to bump this for the weekend and different timezones.
      -rangejoin- from SSC is likely a good solution. The examples in this thread seem to be close to your challenge and may provide some adaptable code: https://www.statalist.org/forums/for...-using-a-range.
      Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

      Comment


      • #4
        Welcome to Statalist. And thank you for the very well-presented example that includes human-readable dates.

        The problem I have - and perhaps other members have had - in addressing your question is that the situation can be more complicated than your examples.

        Consider the following person.
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float recordID str1 personID float(start_num end_num create_num) str9(start_date end_date create_date)
        10 "E" 20759 20761 20763 "1Nov2016"  "3Nov2016"  "5Nov2016" 
        11 "E" 20764 20766 20768 "6Nov2016"  "8Nov2016"  "10Nov2016"
        12 "E" 20769 20771 20773 "11Nov2016" "13Nov2016" "15Nov2016"
        end
        format %td start_num
        format %td end_num
        format %td create_num
        We see that by your standards, observations 10 and 11 are duplicates, and observation 10 would be dropped. But then observations 11 and 12 are duplicates, and observation 11 would be dropped. Now we have only observation 12, and observation 10 has been dropped but there is no "near-duplicate" of it in the data - there's a "near duplicate once removed" perhaps, adopting genealogical terminology.

        Another problem comes to mind: is it possible that there can be multiple observations for the same person created on the same date? That causes problems with your plan for selecting which of a pair of near-duplicates to keep.

        Now let me suggest a different perspective. We see a fair number of questions like this, where the start and end dates from different observations overlap, and the objective is to summarize "spells" within the data. So for example person A had two overlapping spells - 28Feb2016 through 1Nov2016 and 2Mar2016 through 5Nov2016. We would combine these into a single spell from 28Feb2016 through 5Nov2016. And a more general definition of spells can accept non-overlapping spells with brief gaps between them, so that person E may perhaps be best represented as a single spell from 1Nov2016 through 13Nov2016.

        Nick Cox has written on this in two Stata Journal articles, both freely available online.

        https://journals.sagepub.com/doi/pdf...867X0700700209

        https://journals.sagepub.com/doi/pdf...867X1501500121

        Perhaps it would be helpful to think about your problem from this perspective.

        Comment


        • #5
          Eric, thank you for pointing me in that direction! I will try to work with -rangejoin- and will report back here.

          William, thank you for your carefully considered response. I had not thought about the possibility of a person "E" at all. I think (hope) that there would be relatively few records like this in my data. I have not come across any like this in my work thus far. However I think that if I do come across records like I would perhaps be able to flag them, manually review, and make a decision about how to go forward.

          Re cases where the "date created" is the same: this is possible, but I think that it is in the nature of the data that they should be very few. I would probably drop one of the duplicate records at random in these cases.

          The reason for my (not ideal) ideas about how to proceed above are because I am not sure how using spells would help me with my issue. The issue is that I really do need to drop these "near duplicates", I think. For some context:
          • each record is a worker payroll tax record. I am interested in getting monthly labor income as well as employment duration per person per tax year.
          • Sometimes workers have multiple genuine payroll records per tax year - e.g. a form is submitted quarterly by that employer. In these cases I want to sum up their work duration and and income across all 4 records. Another case is that a worker has two jobs at an employer which overlap slightly but are still 2 separate genuine jobs and are submitted on separate payroll firms. Again I would want to sum the duration and income across the 2 records.
          • In other cases, there are multiple records for a worker because the employer has submitted a form *revision*, and both the original and revised form are reflected in the data. I want to drop the original record. I do not want to sum up duration and income across the worker - as using a spell would imply? - because this would effectively spuriously double the income and job duration of that worker.
          If I have misunderstood and using spells would still be a useful solution to my problem, please just let me know.




          Comment


          • #6
            Just to report back: in the time I had with this dataset, I ended up not having an opportunity to try to solve this problem using -rangejoin-. I will try to implement that solution when I am back at the data, but that will be in a few months time and currently I must focus on other priorities. Thank you to everyone for their help.

            Comment

            Working...
            X