Identifying closest date to another date variable.

Kerry Clifton

Join Date: May 2017
Posts: 18

Identifying closest date to another date variable.

29 May 2017, 16:22

Good Afternoon,

I have a dataset which is formatted as follows :

id	interviewdate	bloodtestdate	bloodtestvalue
1	01feb2009	01mar2009	106
1	01feb2009	06apr2009	107
2	24aug2012	17jan2009	500
2	24aug2012	08feb2010	502
2	24aug2012	06apr2011	509
2	24aug2012	02mar2012	400
3	06jan2007	06may2006	350

...and so on (note that these aren't the same dates as in my set, this is an example).

The interviewdate and bloodtestdate are both in 'long' storage type, with '%d' display format.

The 'interviewdate' for each participant is the same, as these are their baseline interview dates.

I am interested in selecting the bloodtestdate which is the closest date to the interview date plus or minus one year.

I thought about dropping all of the bloodtestdate variables that don't fall within a year of the interview date, tagging the first observation with the egen command and going from here but then if a participant has had multiple dates in one year, it won't select the closest date to the interview date.

I have many, many blood test variables that I need to do this for as my analysis goes on. Any help would be greatly appreciated.

Thank you,

Kerry

Tags: date data

William Lisowski

Join Date: Dec 2014
Posts: 10150

29 May 2017, 17:13

Welcome to Statalist, Kerry.

There are a number of ways of accomplishing this. My example below, based on your data, demonstrates some basic technique and is, I hope, simple to follow. It can be tightened up in several ways, but that's an exercise for the reader. You will note that I added id 4 to demonstrate that if there is a tie for closest blood test date, the one before the interview is chosen in preference to the one following.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(intdt bltdt) int bltval
1 17929 17957 106
1 17929 17993 107
2 19229 17914 500
2 19229 18301 502
2 19229 18723 509
2 19229 19054 400
3 17172 16927 350
4 17929 17919 106
4 17929 17939 107
end
format %td intdt
format %td bltdt
* make sure interview dates don't differ within an id
bysort id: assert intdt[1]==intdt[_N]
bysort id: generate gap = bltdt-intdt
bysort id: generate absgap = abs(gap) if abs(gap)<=365
bysort id (absgap bltdt): gen nearest = intdt[1]
format %td nearest
sort id bltdt

Code:

. list, sepby(id)

     +------------------------------------------------------------------+
     | id       intdt       bltdt   bltval     gap   absgap     nearest |
     |------------------------------------------------------------------|
  1. |  1   01feb2009   01mar2009      106      28       28   01feb2009 |
  2. |  1   01feb2009   06apr2009      107      64       64   01feb2009 |
     |------------------------------------------------------------------|
  3. |  2   24aug2012   17jan2009      500   -1315        .   24aug2012 |
  4. |  2   24aug2012   08feb2010      502    -928        .   24aug2012 |
  5. |  2   24aug2012   06apr2011      509    -506        .   24aug2012 |
  6. |  2   24aug2012   02mar2012      400    -175      175   24aug2012 |
     |------------------------------------------------------------------|
  7. |  3   06jan2007   06may2006      350    -245      245   06jan2007 |
     |------------------------------------------------------------------|
  8. |  4   01feb2009   22jan2009      106     -10       10   01feb2009 |
  9. |  4   01feb2009   11feb2009      107      10       10   01feb2009 |
     +------------------------------------------------------------------+

And as I mentioned on your other post, please let the Statalist FAQ inform you on how to improve your questions. It's a nuisance to take data with Human Readable Format dates and get it into Stata converting the strings to Stata Internal Format dates,

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35756

30 May 2017, 02:49

Another way to do it. Not better, just different.

Code:

* William    
bysort id: generate gap = bltdt-intdt
bysort id: generate absgap = abs(gap) if abs(gap)<=365
bysort id (absgap bltdt): gen nearest = intdt[1]

* alternative
bysort id: egen  minabsgap = min(abs(bltdt-intdt))
replace minabsgap = . if minabsgrap > 365
bysort id (minabsgap bltdt): gen nearest = intdt[1]

Last edited by Nick Cox; 30 May 2017, 03:04.

Comment

Kerry Clifton

Join Date: May 2017

Posts: 18
#4

04 Jun 2017, 15:54

Good Evening,

Thank you very much for your responses. My apologies for not importing my data in the correct format.

I have a few questions:

1. What is the purpose of the 'nearest' variable (as essentially it is just a repeat of the intdt variable)?

2. If I wanted to keep the closest blood test value to the interview date of interest, would you recommend sorting by id and absgap, using the egen command to tag the first observation as 1, and then dropping the rest of the values (i.e. dropping any tag = 0 observations)?

Thank you,

Kerry
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

04 Jun 2017, 16:43

1) The command in my example that reads

Code:

bysort id (absgap bltdt): generate nearest = intdt[1]

was meant to read

Code:

bysort id (absgap bltdt): generate nearest = bltdt[1]

and thus gives the date of the blood test nearest to the interview date, with ties broken in favor of the earlier test (before rather than after the interview date).

Code:

. list, sepby(id)

     +------------------------------------------------------------------+
     | id       intdt       bltdt   bltval     gap   absgap     nearest |
     |------------------------------------------------------------------|
  1. |  1   01feb2009   01mar2009      106      28       28   01mar2009 |
  2. |  1   01feb2009   06apr2009      107      64       64   01mar2009 |
     |------------------------------------------------------------------|
  3. |  2   24aug2012   17jan2009      500   -1315        .   02mar2012 |
  4. |  2   24aug2012   08feb2010      502    -928        .   02mar2012 |
  5. |  2   24aug2012   06apr2011      509    -506        .   02mar2012 |
  6. |  2   24aug2012   02mar2012      400    -175      175   02mar2012 |
     |------------------------------------------------------------------|
  7. |  3   06jan2007   06may2006      350    -245      245   06may2006 |
     |------------------------------------------------------------------|
  8. |  4   01feb2009   22jan2009      106     -10       10   22jan2009 |
  9. |  4   01feb2009   11feb2009      107      10       10   22jan2009 |
     +------------------------------------------------------------------+

2) What you write essentially describes what my corrected command does, except I stopped at calculating the nearest date and appending it to all observations. To simply tag the first observation within each id as 1 and the remainder as 0, egen is not needed. Replacing my (now corrected) command with

Code:

bysort id (absgap bltdt): generate tag = _n==1

will set tag to 1 when _n==1 and 0 otherwise, because that is how Stata represents true and false.

But if your objective is indeed to discard all observations but the nearest you do not need the tag.

Code:

bysort id (absgap bltdt): drop if _n!=1

Comment

Kerry Clifton

Join Date: May 2017

Posts: 18
#6

07 Jun 2017, 15:35

Good Afternoon,

Thank you for the updated code. Everything is working perfectly.

I do have a follow up question. If the absgap is too large (i.e. say the bltdt is greater than five years to intdt), is there a way to tell Stata to use the closest date even if it is after the interview date? I.e. to preferentially select the date that falls before unless it is greater than 'x' amount of time?

Thank you again for your help.

Kind regards,

Kerry
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

08 Jun 2017, 08:06

Introducing extra rules suggests assigning each result to a "group" depending on the rules it matches, and then picking the smallest absgap from within the group with the highest priority. In the example below, the lowest priority is "more than plus or minus a year from the interview date" (from post #1), the second-lowest priority is "more than 50 days before the interview" (modified version of the rule in post #6) and the highest priority is all others. ID 5 in the example demonstrates that 70 days after is preferred to 60 days before. ID 4 again demonstrates that 10 days before is preferred to 10 days after.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(intdt bltdt) int bltval
1 17929 17957 106
1 17929 17993 107
2 19229 17914 500
2 19229 18301 502
2 19229 18723 509
2 19229 19054 400
3 17172 16927 350
4 17929 17919 106
4 17929 17939 107
5 17929 17869 106
5 17929 17999 107
end
format %td intdt
format %td bltdt
* make sure interview dates don't differ within an id
bysort id: assert intdt[1]==intdt[_N]
bysort id: generate gap = bltdt-intdt
bysort id: generate absgap = abs(gap)
generate priority = 1
replace priority = 2 if gap<-50
replace priority = 3 if abs(gap)>365
bysort id (priority absgap bltdt): gen nearest = bltdt[1]
format %td nearest
sort id bltdt

Code:

. list, sepby(id)

     +-----------------------------------------------------------------------------+
     | id       intdt       bltdt   bltval     gap   absgap   priority     nearest |
     |-----------------------------------------------------------------------------|
  1. |  1   01feb2009   01mar2009      106      28       28          1   01mar2009 |
  2. |  1   01feb2009   06apr2009      107      64       64          1   01mar2009 |
     |-----------------------------------------------------------------------------|
  3. |  2   24aug2012   17jan2009      500   -1315     1315          3   02mar2012 |
  4. |  2   24aug2012   08feb2010      502    -928      928          3   02mar2012 |
  5. |  2   24aug2012   06apr2011      509    -506      506          3   02mar2012 |
  6. |  2   24aug2012   02mar2012      400    -175      175          2   02mar2012 |
     |-----------------------------------------------------------------------------|
  7. |  3   06jan2007   06may2006      350    -245      245          2   06may2006 |
     |-----------------------------------------------------------------------------|
  8. |  4   01feb2009   22jan2009      106     -10       10          1   22jan2009 |
  9. |  4   01feb2009   11feb2009      107      10       10          1   22jan2009 |
     |-----------------------------------------------------------------------------|
 10. |  5   01feb2009   03dec2008      106     -60       60          2   12apr2009 |
 11. |  5   01feb2009   12apr2009      107      70       70          1   12apr2009 |
     +-----------------------------------------------------------------------------+

Comment

Kerry Clifton

Join Date: May 2017

Posts: 18
#8

19 Jun 2017, 11:11

Hi William,

Great - thank you for your response. I haven't tried it yet as I don't think I will be using the priority rules this round but it will come up for future analyses.

I do have another question. If I want to limit the observations to only those observations that have occurred prior to the interview date, how would I edit the code in Post #2 appropriately?

Thank you again for your help,

Kerry
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

20 Jun 2017, 09:38

And what would you want as a result if there is no observation prior to the interview date?

In general, the lesson of post #2 (as corrected in post #5) is that observations that you do not want considered have their "absgap" replaced by a missing value, which Stata will effectively consider larger than any nonmissing value of absgap. For what you now describe, change the replace command so that rather than excluding observations with bltdt more than a year before or after intdt, instead exclude observations with bltdt greater than intdt.
Comment
Kerry Clifton

Join Date: May 2017

Posts: 18
#10

20 Jun 2017, 17:04

Hi William,

If there is no observation to prior to the interview date I will treat it as missing.

Which 'replace' command do you mean?

Thank you,

Kerry
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

20 Jun 2017, 18:39

That's what I get for not having coffee at breakfast today. This is what I meant to write.

In general, the lesson of post #2 (as corrected in post #5) is that observations that you do not want considered have their "absgap" set to a missing value. Stata will effectively consider the missing value for absgap larger than any nonmissing value of absgap, so that those values are sorted to the end of the list by the command that chooses the nearest.

In your case,

Code:

bysort id: generate absgap = abs(gap) if abs(gap)<=365

will set absgap to missing when bltdt is more than a year before or after intdt, so that gap<-365 or gap>365, and thus abs(gap)>365.

For what you now describe, change the generate command shown above so that it excludes observations with bltdt greater than intdt.
Comment
Kerry Clifton

Join Date: May 2017

Posts: 18
#12

20 Jun 2017, 22:30

Hi William,

So what I have is this (with dxdate instead of bltdt):

bysort protocol_id: assert interviewdt[1]==interviewdt[_N]
bysort protocol_id: generate gap = dxdate-interviewdt
**New code:
bysort protocol_id: generate absgap = abs(gap) if dxdate>interviewdt

However this doesn't seem to be working for me. There are still observations left that have dxdates > interviewdates.

Also, how do I add a qualifier that limits it to going only back "x" number of days?

The code worked perfectly for my previous objectives but I'm having trouble modifying it to work for me now.

Thanks again for your help,

Kerry
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#13

21 Jun 2017, 05:40

Let me go back and rewrite post #2 as corrected by #5 to meet your current requirement that the chosen blood test date come before the interview date.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(intdt bltdt) int bltval
1 17929 17957 106
1 17929 17993 107
2 19229 17914 500
2 19229 18301 502
2 19229 18723 509
2 19229 19054 400
3 17172 16927 350
4 17929 17919 106
4 17929 17939 107
end
format %td intdt
format %td bltdt
* make sure interview dates don't differ within an id
bysort id: assert intdt[1]==intdt[_N]
// how long from blood test to interview date
// gap will be <0 if before and >0 if after
generate gap = bltdt-intdt
// absolute value of gap to find closest blood test
generate absgap = abs(gap)
// exclude tests that follow the interview
replace absgap = . if bltdt > intdt
// sort by closeness, with ties broken by tests before the interview
//   and drop all but the nearest
bysort id (absgap bltdt): drop if _n>1

Code:

. list, clean

       id       intdt       bltdt   bltval    gap   absgap  
  1.    1   01feb2009   01mar2009      106     28        .  
  2.    2   24aug2012   02mar2012      400   -175      175  
  3.    3   06jan2007   06may2006      350   -245      245  
  4.    4   01feb2009   22jan2009      106    -10       10

With that said, you can add a second replace command to exclude those that are too long before the interview. Note that you will want to add code to handle cases like id 1 where there is no blood test before the interview date. Perhaps

Code:

replace bltval = . if absgap==.
replace bltdt = . if absgap==.

will do what you need.

Last edited by William Lisowski; 21 Jun 2017, 05:43.

Comment

Kerry Clifton

Join Date: May 2017

Posts: 18
#14

22 Jun 2017, 10:32

Hi William,

Thank you so much, that code worked perfectly for me. Thank you for including the descriptions as to what each line does as well.

Kind regards,

Kerry
Comment
Gabriela Toledo

Join Date: Feb 2020

Posts: 1
#15

23 Jul 2020, 10:29

Hi all,

I’ve used the command above to do find the minimum between two dates in my dataset. I’ve noticed every time I run my code, things look slightly different (minimal) and I think it’s because of my dataset and how I am using my unique identifiers. I have a dataset with three unique identifiers: StudyID_MIP (identifies mother-infant pairs), Mom_ID (mother’s ID), and Inf_ID (infant’s ID). As twins and triplets were enrolled in this study, the mother-infant pair ID is used to distinguish between singleton, twin, and triplet births. So for a mother with triplets, the IDs look like:

StudyID_MIP Mom_ID Inf_ID
1230000 1230000 1230001
1230002 1230000 1230002
1230003 1230000 1230003

When using the command above, I sorted by Mom_ID and also sorted by Mom_ID before I dropped any duplicates (to keep the first row with the nearest date). I’ve noticed when browsing the data that a different row (i.e. infant) is retained in the dataset after dropping the duplicates – the only explanation I can think of is that Stata is randomly sorting between Mom_IDs for mothers with more than one infant enrolled in the study, and therefore, a different infant for the same mother is being dropped whenever I force duplicates to drop as it can't distinguish since the mom's ID is the same. I wanted to ask if you could provide some guidance on how to overcome this problem? Ultimately, I would like the same infant (“row”) to be retained whenever I drop duplicates, and ideally, this would be the first enrolled infant (so when StudyID_MIP is the same as Mom_ID), as mother’s responses were most accurate for the first infant. I thought maybe sorting by StudyID_MIP rather than Mom_ID would help solve this issue but I’m not sure. If you could provide some insight on how to use the following command given the unique identifiers in my dataset that would be much appreciated.

Thanks in advance!
Gabriela

Last edited by Gabriela Toledo; 23 Jul 2020, 10:32.
Comment

Announcement

Identifying closest date to another date variable.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment