How to identify duplicates of observations in 30 days intervals?

Edith Reyes

Join Date: Feb 2023

Posts: 5
#1

How to identify duplicates of observations in 30 days intervals?

27 Feb 2023, 12:42

Hello everyone!
I'm learning using Stata software and this forum.

I am writing this post because I need your help!

I'd like to know if a command could help me identify duplicates of observations in each 30 days interval in a database of isolated bacteria.

In other words, I want to identify all bacteria isolated in the same patient (observations or rows in our database) that have the same characteristics in other variables, but they were isolated in a period of 30 days. I want to drop these bacteria because they are considered the same sample for our analysis.
For example, in the image, we have two patients with different isolates of bacteria. In the second one, we have Enterococcus faecium isolated five times on different days but all of them are inside a period of 30 days, this means that it is repeated and it isn't useful for the analysis. How can I identify and drop these duplicates?

Thanks a lot!

Edith.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#2

27 Feb 2023, 13:44

(Software glitch duplicated my post.)
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#3

27 Feb 2023, 13:44

(Software glitch duplicated my post.)
Comment
Edith Reyes

Join Date: Feb 2023

Posts: 5
#4

28 Feb 2023, 09:54

Originally posted by Mike Lacy View Post

(Software glitch duplicated my post.)

Dear Mike, I can't see your answer.
Could you explain it again? Please.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#5

28 Feb 2023, 13:21

Sorry, when I deleted the duplicates that the StataList posting software weirdly generated, I now see that I accidentally dropped the original posting as well.

My response was that, first, if you posted your example data with -dataex-, as described and prescribed in section 12.2 of the StataList FAQ that new members are asked to read, you'd likely get a precise and helpful answer quickly. Note also that screen images, as you have supplied, aren't as helpful as one might think, and are accordingly not popular here.

Without a -dataex- example, and perhaps some further explanation of what you want, I can offer an abstract answer that may or may not help you: 1) If your date variable is not a Stata date variable, use the appropriate date functions to create that variable (-help date functions-); 2) Drop observations that occur within 30 days of some starting observation. (-help bysort- might be relevant).

To do what you want, my understanding would be that you need to make some substantive decision about which one of a set of duplicates you want to keep. The easiest thing to do would be to keep the earliest one, and then drop all of the ones occurring in the next 30 days. If a duplicate occurred at the 31st day, then you'd start a new set of duplicates, etc. That might not be what you want.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#6

28 Feb 2023, 13:56

To Mike Lacy's advice I will just add that you also need to clarify an ambiguity in your request. Suppose somebody has the same test on day 1, day 15, and day 35. What is to be kept? On the one hand, day 35 is within 30 days of day 15, and day 15 is within 30 days of day 1. So we might just keep day 1. On the other hand, we could also reason differently: day 15 is within 30 days of day 1. Therefore we drop day 15. Now day 35 is no longer within 30 days of anything, so it is retained.
Comment

Edith Reyes

Join Date: Feb 2023
Posts: 5

02 Mar 2023, 08:35

Dear,

Mike Lacy

and

Clyde Schechter

First of all, thank you for your observations, I have already taken them into account to improve my questions and syntaxis.
So, I rewrite the question trying to be more specific and added the dataex to facilitate the example of our database.

As Mr.Schechter said, firstly we want to identify the date of the first sample isolated (DAY 1) in each patient, and from that point, we want to start to count 30 days to identify the duplicate bacterias ("organismo") that have the same values in other variables.

We want:
STEP 1. To identify the same "organismo" in the same "patienid" and "muestra", that have a close date ("f_colect") between them.
STEP 2. To review if they have the same values in all other variables (excluding "f_colect" and "servicio") e.g. "ampi", "anamic", "ceftaz", etc.
STEP 3. To search which bacterias from STEP 2 are in a period of 30 days (taking the first date reported "f_colect" as DAY 1). For that, we need a way to count 30 days since DAY 1 automatically.
STEP 4. To identify "organismo" duplicated inside this 30 days period. Is there any command for that?
STEP 5. To drop the "organism" duplicated every 30 days.

The question here is, what are the commands that help us to identify duplicate data of "organism" with the condition that they have the same values in the other variables, except for "service" and "f_collect", and that they are within a 30-day period (date=f_colect)?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str16 patientid long(servicio muestra) float f_colect long organismo double ampi long(anamic ceftaz) double(cipro clinda ceft cefotx erit lev lnz mxf bpen tetra tgc) long piptz double(van vrcz) long(aztm clor)
"843022" 212 3044 21649 344   . .  .   4 .25   .   . .25   4 1   1   .  16 .25   .   1 . . .
"843022" 212 3044 21655 344   . .  .   4 .25   .   . .25   4 1   1   .  16 .12   .  .5 . . .
"843022" 212 3044 21655 344   . .  .   4 .25   .   . .25   4 1   1   .  16 .25   .  .5 . . .
"843022" 212 3045 21649 327   . .  .   4 .25   .   . .25   4 1   1   .  16 .25   .   1 . . .
"843022" 212 3045 21649 344   . .  .   4 .25   .   . .25   4 1   1   .  16 .25   .   1 . . .
"843022" 212 3045 21653 344   . .  .   4 .25   .   . .25   4 1   1   .  16 .25   .  .5 . . .
"843022" 212 3081 20775 320  32 .  1 .25   .  64   .   .   . .   2   .   .  .5   4   . . . .
"843022" 206 3081 20867 320  32 . 64   4   .  64   .   .   . .   8   .   .  .5  64   . . . .
"843022" 212 3081 20891 320  32 .  4   4   .  64   .   .   . .   8   .   .  .5   4   . . . .
"843022" 212 3081 20894 320  32 .  4   4   .  64   .   .   . .   8   .   .  .5   4   . . . .
"843022" 212 3081 21025 320  32 .  1 .25   .   1   .   .   . .   1   .   .  .5   4   . . . .
"843022" 212 3081 20774 325  32 .  1   1   .   1   .   .   . .   8   .   .   1   4   . . . .
"843022" 212 3081 20774 333  32 . 64   4   .  64   .   .   . .   8   .   .   8 128   . . . .
"843057" 210 3046 21025 343   . .  .   8   8   .   .   8   4 1   1  .5   1 .12   .  .5 . . .
"843057" 210 3046 21025 343   . .  .   8   8   .   .   8   4 2   1  .5   1 .12   .  .5 . . .
"843057" 206 3046 21011 344   . .  .   8   8   .   .   8   8 1   2  .5   2 .12   .   1 . . .
"843057" 206 3046 21011 344   . .  .   8   8   .   .   8   8 1   2  .5   2 .12   .   1 . . .
"843057" 210 3046 21025 344   . .  .   8   8   .   .   8   8 1   2  .5   1 .12   .   1 . . .
"843057" 210 3046 21025 344   . .  .   8   8   .   .   8   8 1   2  .5   1 .12   .   2 . . .
"843057" 210 3046 21027 344   . .  .   8   8   .   .   8   8 1   2  .5   1 .12   .   1 . . .
"843057" 212 3046 21082 344   . .  .   8   8   .   .   8   8 1   2  .5   2 .25   .   1 . . .
"843057" 212 3081 20781 320  32 .  1 .25   .   1   .   .   . .   2   .   .  .5   4   . . . .
"843057" 212 3081 20781 320  32 .  1 .25   .   1   .   .   . .  .5   .   .  .5   4   . . . .
"843057" 212 3081 21297 320  32 .  1 .25   .   1   .   .   . . .25   .   .  .5   4   . . . .
"843057" 212 3081 21314 320  32 . 16   4   .  64   .   .   . .   8   .   .  .5   8   . . . .
"843057" 210 3081 21032 322  32 .  4   1   .  16   .   .   . .   2   .   .  .5  16   . . . .
"843057" 212 3081 21314 322  32 . 16   2   .  64   .   .   . .   2   .   .   1  16   . . . .
"843057" 206 3081 21012 327   . .  .  .5 .25   .   .   8 .12 2 .25 .25   1 .12   .  .5 . . .
"843057" 206 3081 21012 344   . .  .  .5 .25   .   .   8 .12 2 .25 .03   2 .25   .   1 . . .
"843184" 209 3046 20816 344   . .  .   4   8   .   .   8   4 1   1  .5   1 .12   .   1 . . .
"843184" 209 3046 20816 344   . .  .   4   8   .   .   8   4 1   1  .5   1 .12   .   1 . . .
"843184" 209 3046 20850 344   . .  .   8   8   .   .   8   8 1   2  .5   2 .12   .   1 . . .
"843184" 211 3046 21404 344   . .  .   8   8   .   .   8   8 2   2  .5   2 .25   .   1 . . .
"843184" 211 3046 21405 344   . .  .   8   8   .   .   8   8 2   2  .5   2 .25   .   1 . . .
"843184" 211 3046 21406 344   . .  .   8   8   .   .   8   8 2   2  .5   2 .25   .   1 . . .
"843184" 211 3046 21409 344   . .  .   8   8   .   .   8   8 2   2  .5   2  .5   .   1 . . .
"843184" 211 3046 21409 344   . .  .   8   8   .   .   8   8 2   2  .5   2  .5   .   1 . . .
"843184" 211 3046 21419 344   . .  .   8   8   .   .   8   8 2   2  .5   2 .25   .   1 . . .
"843184" 206 3046 21430 344   . .  .   8   8   .   .   8   8 2   2  .5   2 .25   .   1 . . .
"843184" 206 3046 21431 344   . .  .   8   8   .   .   8   8 1   2  .5   2 .25   .   1 . . .
"843184" 206 3046 21434 344   . .  .   8   8   .   .   8   8 1   2  .5   1 .12   .   1 . . .
"843184" 211 3046 21419 347 .25 .  .   . .25 .12 .12   4   1 2   . .06 .25   .   . .12 . . .
"843195" 222 3046 20845 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .  .5 . . .
"843195" 222 3046 20845 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .   1 . . .
"843195" 222 3046 20845 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .  .5 . . .
"843195" 222 3046 20850 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .  .5 . . .
"843195" 222 3046 20851 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .  .5 . . .
"843195" 222 3046 20857 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .   2 . . .
"843195" 222 3046 20857 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .   1 . . .
"843195" 222 3046 20859 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .   1 . . .
"843195" 222 3046 20859 343   . .  .   8   8   .   .   8   8 2   4  .5   1 .12   .   1 . . .
"843195" 222 3046 20815 344   . .  .   8   8   .   .   8   8 1   8  .5   2 .12   .   1 . . .
"843195" 222 3046 20837 345   . .  .   .   .   .   .   .   . .   .   .   .   .   .   . . . .
"843195" 222 3046 20838 345   . .  .   .   .   .   .   .   . .   .   .   .   .   .   . . . .
"843195" 222 3046 20864 345   . .  .   .   .   .   .   .   . .   .   .   .   .   .   . . . .
"843195" 222 3046 20865 345   . .  .   .   .   .   .   .   . .   .   .   .   .   .   . . . .
"843233" 216 3046 21081 327   . .  .  .5 .25   .   . .25 .25 2 .25 .03   1 .12   .  .5 . . .
"843233" 216 3046 20961 343   . .  .   8   8   .   .   8   8 1   2  .5   1 .12   .   2 . . .
"843233" 216 3046 21081 343   . .  .   8   8   .   .   8   8 2   2  .5   1 .12   .  .5 . . .
"843233" 216 3046 21081 343   . .  .   8   8   .   .   8   8 2   2  .5   1 .12   .  .5 . . .
"843233" 216 3046 21082 343   . .  .   8   8   .   .   8   8 2   2  .5   1 .12   .  .5 . . .
"843233" 216 3046 21082 343   . .  .   8   8   .   .   8   8 2   2  .5   1 .12   .  .5 . . .
"843233" 216 3046 21082 343   . .  .   8   8   .   .   8   8 2   2  .5   1 .12   .   1 . . .
"843233" 215 3046 20857 344   . .  .   4   8   .   .   8   4 1   1  .5   2 .12   .   1 . . .
"843233" 216 3046 20912 344   . .  .   8   8   .   .   8   8 1   2  .5   2 .12   .   1 . . .
"843233" 216 3051 21077 343   . .  .   8   8   .   .   8   8 2   2  .5   1 .12   .   1 . . .
"843233" 216 3051 21077 343   . .  .   8   8   .   .   8   8 2   2  .5   1 .12   .  .5 . . .
"843233" 216 3081 21080 320  32 . 64   4   .  64   .   .   . .   8   .   .  .5   8   . . . .
"843233" 216 3081 20954 322  32 . 64   2   .  64   .   .   . .   2   .   .  .5 128   . . . .
"843233" 216 3081 20954 333  32 .  4 .25   .  32   .   .   . .   2   .   .   8   8   . . . .
"843233" 216 3081 21102 333  32 .  2 .25   .  16   .   .   . .   1   .   .   8   4   . . . .
"843233" 216 3081 21104 333  32 .  2 .25   .  16   .   .   . .   1   .   .   8   4   . . . .
"843233" 216 3081 21115 333  32 .  2 .25   .   8   .   .   . .   1   .   .   8   4   . . . .
"843233" 216 3081 21115 333  32 .  2 .25   .  16   .   .   . .   4   .   .   8   4   . . . .
"843233" 216 3081 20941 345   . .  .   .   .   .   .   .  .5 .   .   .   .   .   .   . . . .
"843233" 216 3081 20941 345   . .  .   .   .   .   .   .   . .   .   .   .   .   .   . . . .
"843233" 216 3081 21102 345   . .  .   .   .   .   .   .   4 .   .   .   .   .   .   . . . .
"843233" 216 3081 21102 345   . .  .   .   .   .   .   .   . .   .   .   .   .   .   . . . .
"843233" 216 3081 21104 345   . .  .   .   .   .   .   .   . .   .   .   .   .   .   . . . .
end
format %tdDD/NN/CCYY f_colect
label values servicio service
label def service 206 "Urgencias", modify
label def service 209 "Hemato-oncología", modify
label def service 210 "Infectología", modify
label def service 211 "Medicina interna", modify
label def service 212 "Nefrología", modify
label def service 215 "Neurología", modify
label def service 216 "Neurocirugía", modify
label def service 222 "Terapia médica pediátrica", modify
label values muestra cultivo
label def cultivo 3044 "Hemocultivo central", modify
label def cultivo 3045 "Hemocultivo periférico", modify
label def cultivo 3046 "Hemocultivo sin especificar", modify
label def cultivo 3051 "LCR", modify
label def cultivo 3081 "Urocultivo", modify
label values organismo morganismo
label def morganismo 320 "Escherichia coli", modify
label def morganismo 322 "Klebsiella pneumoniae", modify
label def morganismo 325 "Morganella morganii", modify
label def morganismo 327 "Otros Staph.Coagulasa Negativo", modify
label def morganismo 333 "Pseudomona aeruginosa", modify
label def morganismo 343 "Staphylococcus aureus", modify
label def morganismo 344 "Staphylococcus epidermidis", modify
label def morganismo 345 "Stenotrophomona maltophilia", modify
label def morganismo 347 "Streptococcus viridians group", modify
label values ampi ampi
label values anamic anamic
label values ceftaz ceftaz
label values cipro cipro
label values clinda clinda
label values ceft ceft
label values cefotx cefotx
label values erit erit
label values lev lev
label values lnz lnz
label values mxf mxf
label values bpen bpen
label values tetra tetra
label values tgc tgc
label values piptz piptz
label values van van
label values vrcz vrcz
label values aztm aztm
label values clor clor
label var patientid "Patient ID" 
label var servicio "Patient Location" 
label var muestra "Specimen Source" 
label var organismo "Organism Name" 
label var ampi "AM-Ampicilina" 
label var anamic "AN-Amicacina" 
label var ceftaz "CAZ-Ceftazidima" 
label var cipro "CIP-Ciprofloxacino" 
label var clinda "CM-Clindamicina" 
label var ceft "CRO-Ceftriaxona" 
label var cefotx "CTX-Cefotaxima" 
label var erit "E-Eritromicina" 
label var lev "LEV-Levofloxacino" 
label var lnz "LNZ-Linezolid" 
label var mxf "MXF-Moxifloxacino" 
label var bpen "P-Bencilpenicilina" 
label var tetra "TE-Tetraciclina" 
label var tgc "TGC-Tigeciclina" 
label var piptz "TZP-Piperacilina/Tazobactam" 
label var van "VA-Vancomicina" 
label var vrcz "VRC-Voriconazol" 
label var aztm "ATM-Aztreonam" 
label var clor "C-Cloranfenicol"

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#8

02 Mar 2023, 10:11

Thank you for the extensive -dataex- output. Unfortunately, you did not answer the question I posed in #6, so I'm going to guess. My guess is that if the same organism is identified on days 1, 15, and 31, you wish to drop the day 15 observation but retain the day 36 observation. Moreover, I'm also assuming that you want that day 36 observation to now be treated as a "day 1" for another 30 day period, and so on. If I have guessed wrong, this code will not do what you want. But perhaps it will point you in the right direction and you can adapt it to do what you actually want.

Code:

unab abx: ampi-clor by organismo patientid muestra `abx' (f_colect), sort: /// gen ref_date = f_colect if _n == 1, after(f_colect) by organismo patientid muestra `abx' (f_colect): /// replace ref_date = cond(f_colect-ref_date[_n-1] >= 30, f_colect, ref_date[_n-1]) /// if _n > 1 format ref_date %tdDD/NN/CCYY drop if f_colect != ref_date

Further confusion:
I do not understand what "STEP 4. To identify "organismo" duplicated inside this 30 days period. Is there any command for that?" means. What does it mean to "identify" something? What do you want to do? How would the results look in Stata? So the code takes no particular actions for this step.

You say "all other variables (excluding "f_colect" and "servicio") e.g. "ampi", "anamic", "ceftaz", etc.," but this seems to omit the variable servicio. I notice that in your example data, servicio is always constant within the observations having the same organismo patientid and muestra. If that is true in the full data set (the code does not check this) then it doesn't matter anyway. If not, you may have to change the code to accommodate that. I also note that the other variables involved are all names of antibiotics, and servicio is not. And while I can see that for most purposes servicio would not really help "identify" an organism whereas antibiotic resistance patterns would, I can also envision research questions where servicio would be part of the identity of an organism. Anyway, I leave it to you to work this out.
1 like
Comment
Edith Reyes

Join Date: Feb 2023

Posts: 5
#9

15 Mar 2023, 09:53

Dear

Mr. Schechter

,

Thank you so much for your help.

Sorry for the confusion, we will consider it for the following posts. We think your proposal of command will be extremely helpful for us with some modifications.

Best regards!
Comment

Announcement