Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouping observations by relative date to remove duplicates

    I have a dataset with a several million observation, some of which are duplicates. I would like to process through the data trying to find these duplicates when some values of a variable are close in time to another. For example, any observation having the same first and last name, and with a start_date somewhere within 30 days of the "first" observation in that group would be considered a group, and then I could remove duplicates.

    Code:
    clear
    input obs_num str5(first_name last_name) str9 str_date
    1 John Smith 20may2024
    2 John Smith 25may2024
    3 John Smith 04jun2024
    4 John Smith 30jun2024
    5 John Smith 30jun2024
    6 John Smith 12jul2024
    7 John Smith 12apr2025
    8 Sally Smith 13may2024
    9 Sally Smith 21nov2025
    end
    
    gen double start_date = date(str_date, "DMY")
    format start_date %td
    drop str_date
    list



    Ideally I would see the first three observations be listed in group 1, the next three in group 2, and the last three observations each be their own groups, 3, 4, & 5, respectively. I tried writing some code to look at the problem by previous row, but I'm not sure how to handle the 5th and 6th observations, where they are more than 30 days from the first, but less than thirty days from the 4th, which should be their basis value. I thought that I could try iterating the code below a bit, but with so many observations in my dataset I'm not sure how many times I'd need to and what the overall implications might end up being.

    Code:
    sort first_name last_name start_date
    by first_name last_name: gen first = (_n==1)
    by first_name last_name: egen target = min(start_date)
        format target %td
    recode first (0 = 1) if start_date - target > 30
    list
    Is there some elegant way that I can ask Stata to both evaluate whichever previous observation has a "1" stored in the variable first, but also, if the dates are too far away, change first to equal "1" so that the following observations can be based off of it?

  • #2
    I think what you want is this:
    Code:
    gen int start_date = date(str_date, "DMY")
    format start_date %td
    drop str_date
    
    by last_name first_name (start_date), sort: gen int ref_date = start_date if _n == 1
    by last_name first_name (start_date): replace ref_date = ///
        cond(start_date - ref_date[_n-1] <= 30, ref_date[_n-1], start_date) if _n > 1
    format ref_date %td
    by last_name first_name (start_date): gen `c(obs_t)' group = sum(ref_date != ref_date[_n-1])
    As an aside, there is no need to store a date variable as a -double-. In fact, an -int- is large enough. And in a data set with several million observations, that is a lot of wasted storage. Not only is it wasted storage, but its presence may slow down some analyses you might perform.


    Comment


    • #3
      That is great, thank you Clyde.

      Comment

      Working...
      X