complex sort & identify closest date to each target date

Bob Green

Join Date: Oct 2015
Posts: 85

complex sort & identify closest date to each target date

22 Sep 2018, 19:32

In the dataset below there are 5 variables: studyid (individual person identifier), offencedate (date an offence was committed), startdate (date service provided) and mode (type of service). modeorder is created by the code below

For each individual I want to find the startdate closest to each offence. Ideally I would do this for each mode and all modes combined.

The code below sorts the data but the mode order only lists the closest service date to the last offence, not to each offence for each studyid.

The output I am hoping for would be by studyid. For studyid =1, there are 7 offence dates, so for all modes combined there would be 7 rows, service closest to 16/06/2011 would be 3/10/11 In person, while for the last offence on 8/11/2011 In person

Any assistance is appreciated.

Code:

by studyid (offencedate startdate mode), sort: generate modeorder= _n

Data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte studyid float(offencedate startdate) str9 mode float modeorder
1 18794 18652 "In Person"  1
1 18794 18653 "In Person"  2
1 18794 18696 "In Person"  3
1 19075 18859 "Telephone"  4
1 19086 18911 "In Person"  5
1 19086 18939 "In Person"  6
1 19086 18939 "In Person"  7
1 19086 18941 "Other"      8
1 19093 18933 "Other"      9
1 19093 18934 "In Person" 10
1 19130 18884 "In Person" 11
1 19130 18911 "In Person" 12
1 19130 19012 "Telephone" 13
1 19130 19044 "In Person" 14
1 19172 18859 "Telephone" 15
1 19172 18878 "In Person" 16
1 19172 18884 "In Person" 17
1 19172 18911 "In Person" 18
1 19172 18911 "In Person" 19
1 19172 18911 "In Person" 20
1 19172 18925 "In Person" 21
1 19172 18933 "Other"     22
1 19172 18934 "In Person" 23
1 19177 18939 "In Person" 24
1 19177 18939 "In Person" 25
2 19086 18925 "In Person"  1
2 19086 18933 "Other"      2
2 19086 18934 "In Person"  3
2 19086 18948 "Other"      4
2 19086 18953 "Other"      5
2 19093 18911 "In Person"  6
2 19093 19044 "In Person"  7
2 19130 18859 "Telephone"  8
2 19130 18878 "In Person"  9
2 19177 18925 "In Person" 10
2 19177 18933 "Other"     11
2 19177 18934 "In Person" 12
3 19093 18925 "In Person"  1
3 19093 19012 "Telephone"  2
3 19130 18911 "In Person"  3
3 19130 18911 "In Person"  4
3 19130 18955 "In Person"  5
3 19130 18976 "In Person"  6
3 19172 18939 "In Person"  7
3 19177 18911 "In Person"  8
3 19177 18911 "In Person"  9
end
format %td offencedate
format %td startdate

Last edited by Bob Green; 22 Sep 2018, 19:34.

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30147

22 Sep 2018, 23:10

This will work with your example data (with the variable modeorder removed):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte studyid float(offencedate startdate) str9 mode
1 18794 18652 "In Person"
1 18794 18653 "In Person"
1 18794 18696 "In Person"
1 19075 18859 "Telephone"
1 19086 18911 "In Person"
1 19086 18939 "In Person"
1 19086 18939 "In Person"
1 19086 18941 "Other"    
1 19093 18933 "Other"    
1 19093 18934 "In Person"
1 19130 18884 "In Person"
1 19130 18911 "In Person"
1 19130 19012 "Telephone"
1 19130 19044 "In Person"
1 19172 18859 "Telephone"
1 19172 18878 "In Person"
1 19172 18884 "In Person"
1 19172 18911 "In Person"
1 19172 18911 "In Person"
1 19172 18911 "In Person"
1 19172 18925 "In Person"
1 19172 18933 "Other"    
1 19172 18934 "In Person"
1 19177 18939 "In Person"
1 19177 18939 "In Person"
2 19086 18925 "In Person"
2 19086 18933 "Other"    
2 19086 18934 "In Person"
2 19086 18948 "Other"    
2 19086 18953 "Other"    
2 19093 18911 "In Person"
2 19093 19044 "In Person"
2 19130 18859 "Telephone"
2 19130 18878 "In Person"
2 19177 18925 "In Person"
2 19177 18933 "Other"    
2 19177 18934 "In Person"
3 19093 18925 "In Person"
3 19093 19012 "Telephone"
3 19130 18911 "In Person"
3 19130 18911 "In Person"
3 19130 18955 "In Person"
3 19130 18976 "In Person"
3 19172 18939 "In Person"
3 19177 18911 "In Person"
3 19177 18911 "In Person"
end
format %td offencedate
format %td startdate

//    SEPARATE THE DATA INTO AN OFFENCE DATA SET
//    AND A SERVICE DATA SET

preserve
keep studyid offencedate
tempfile offences
save `offences'

restore 
keep studyid startdate mode
tempfile services
save `services'

//    PAIR EACH OFFENSE WITH EACH SERVICE WITHIN STUDYID
use `offences', clear
joinby studyid using `services'

//    CALCULATE TIME BETWEEN EACH OFFENSE AND EACH SERVICE
//    AND THEN KEEP THE CLOSEST ONE
by studyid offencedate, sort: gen delta = abs(startdate - offencedate)
by studyid offencedate (delta startdate), sort: keep if _n == 1
rename startdate closest_service_date_any_mode
rename mode mode_closest_service
drop delta

//    DO IT AGAIN, THIS TIME SEPARATING EACH SERVICE TYPE
joinby studyid using `services'
by studyid offencedate, sort: gen delta = abs(startdate - offencedate)
by studyid offencedate mode (delta startdate), sort: keep if _n == 1
rename startdate nearest_date_

//    NOW GO WIDE ON MODE OF SERVICE
replace mode = strtoname(lower(mode))
drop delta
reshape wide nearest_date, i(studyid offencedate) j(mode) string

Note: The criterion "nearest date" is actually ill-defined. It is possible to have two service dates that are equally near the offense date. One could be 3 days before and the other 3 days after. Or they could both be the same date. You do not specify how you want to handle ties like this. In the code above, if there is one before and one after, the one before is chosen. If there are two services on the same date before an offence, and nothing is closer then them before or after, then the tie is broken randomly and irreproducibly. If this handling of ties is not satisfactory, the code needs to be modified to fit your additional specifications.

Comment

Bob Green

Join Date: Oct 2015
Posts: 85

23 Sep 2018, 01:44

Clyde,

Many thanks. In relation to your questions: Nearest date should, as you assumed be the date before not after. In the case of ties, it doesn't matter which is selected. I'll run separate analyses by mode. Ties will matter when I want to count the number of services.

In relation to running the code, here is my output - the datafile is blank except for the variable names. I tried running it from the data file then form the dataex output in your post. I must be making some basic error

Thanks again
.

Code:

 //    SEPARATE THE DATA INTO AN OFFENCE DATA SET
. //    AND A SERVICE DATA SET
. 
. preserve

. keep studyid offencedate

. tempfile offences

. save `offences'
file C:\Users\Bob\AppData\Local\Temp\ST_07000002.tmp saved

. 
. restore 

. keep studyid startdate mode

. tempfile services

. save `services'
file C:\Users\Bob\AppData\Local\Temp\ST_07000003.tmp saved

. 
. //    PAIR EACH OFFENSE WITH EACH SERVICE WITHIN STUDYID
. use `offences', clear

. joinby studyid using `services'

. 
. //    CALCULATE TIME BETWEEN EACH OFFENSE AND EACH SERVICE
. //    AND THEN KEEP THE CLOSEST ONE
. by studyid offencedate, sort: gen delta = abs(startdate - offencedate)

. by studyid offencedate (delta startdate), sort: keep if _n == 1
(835 observations deleted)

. rename startdate closest_service_date_any_mode

. rename mode mode_closest_service

. drop delta

. 
. //    DO IT AGAIN, THIS TIME SEPARATING EACH SERVICE TYPE
. joinby studyid using `services'

. by studyid offencedate, sort: gen delta = abs(startdate - offencedate)

. by studyid offencedate mode (delta startdate), sort: keep if _n == 1
(218 observations deleted)

. rename startdate nearest_date_

. 
. //    NOW GO WIDE ON MODE OF SERVICE
. replace mode = strtoname(lower(mode))
(41 real changes made)

. drop delta

. reshape wide nearest_date, i(studyid offencedate) j(mode) string
(note: j = in_person other telephone)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                       41   ->      15
Number of variables                   6   ->       7
j variable (3 values)              mode   ->   (dropped)
xij variables:
                          nearest_date_   ->   nearest_date_in_person nearest_date_other nearest_date_telephone
-----------------------------------------------------------------------------

. 
end of do-file

Comment

Bob Green

Join Date: Oct 2015

Posts: 85
#4

23 Sep 2018, 05:28

Clyde,

For some reason when I tried to run the code again, it ran without problem. On the second attempt it also ran on real data. A mystery.

The main issue I noted on the cases I checked is that if there is a date after the offencedate that is closer or there is no prior startdate, the post offence date is entered.

I can identify instances of this by subtracting all dates from the offence date and seeing how often this happens. If it is infrequent, I'll probably just manually edit them.

Thanks again.

Bob

Last edited by Bob Green; 23 Sep 2018, 06:20.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30147

23 Sep 2018, 13:02

The main issue I noted on the cases I checked is that if there is a date after the offencedate that is closer or there is no prior startdate, the post offence date is entered.

Why is that an issue? You said you wanted the closest date. If the closest date is after rather than a before date, then, yes, the closest date is entered: that's what you said you wanted.

Don't manually edit data in real research; you need an audit trail of what you did, both for your own understanding if you return to this after several months, and for authors who want to know why they should believe your results. While you can, in a sense, make an audit trail of manual editing by doing it with a log file open, the resulting log, full of statements like -replace x = 47 in 3- is completely incomprehensible. You will not be able to remind yourself or explain to others what those changes were and why they were made.

If you only want dates prior to the offence, then the code can be changed to do that:

Code:

// SEPARATE THE DATA INTO AN OFFENCE DATA SET
// AND A SERVICE DATA SET

preserve
keep studyid offencedate
tempfile offences
save `offences'

restore
keep studyid startdate mode
tempfile services
save `services'

// PAIR EACH OFFENSE WITH EACH SERVICE WITHIN STUDYID
use `offences', clear
joinby studyid using `services'

// CALCULATE TIME BETWEEN EACH OFFENSE AND EACH SERVICE
// AND THEN KEEP THE CLOSEST ONE
by studyid offencedate, sort: gen delta = offencedate - startdate
drop if delta < 0
by studyid offencedate (delta startdate), sort: keep if _n == 1
rename startdate closest_service_date_any_mode
rename mode mode_closest_service
drop delta

// DO IT AGAIN, THIS TIME SEPARATING EACH SERVICE TYPE
joinby studyid using `services'
by studyid offencedate, sort: gen delta = offencedate - startdate
drop if delta < 0
by studyid offencedate mode (delta startdate), sort: keep if _n == 1
rename startdate nearest_date_

// NOW GO WIDE ON MODE OF SERVICE
replace mode = strtoname(lower(mode))
drop delta
reshape wide nearest_date, i(studyid offencedate) j(mode) string

list, noobs clean abbrev(16)

Changes in bold face.

Comment

Bob Green

Join Date: Oct 2015

Posts: 85
#6

23 Sep 2018, 14:25

Thanks Clyde. I'll study the code and the output. Going through it is likely to take a little time.

It is always difficult to know how much information to provide. I'm interested in the 12 months prior to an offence and the potential for preventative action. There are some events of interest that happen afterwards but services provided isn't one of them. Closing the case would be.

Knowing the offence date and date closest to an offence, can this code be modified to get a count of services provided or is this a completely different approach required.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#7

23 Sep 2018, 19:16

Bob, I'm getting frustrated. You keep changing the problem specifications.

So before I invest any more time in further modifying the code to now accommodate having only dates in the 12 months prior to the offence, except for closing the case, and to get a count of all services, I'm going to ask you to give some serious consideration to exactly what you want and post back with that. And make sure you are clear about what constitutes a service; I can't guess whether case closure is a service or not. Provide a complete and unambiguous specification of exactly what you want. Also, repost a new data example that will exercise the code well. For example, if you want to look at case closures, especially under a different time frame, be sure that there are some case closures in the example you submit.
Comment
Bob Green

Join Date: Oct 2015

Posts: 85
#8

23 Sep 2018, 20:32

Clyde,

Apologies. My intention wasn't too frustrate but provide an example of a scenario where something may have been of interest after an offence. Forget about case closure it is another variable and has nothing to do with the data I posted up or my question. The only services are those covered by mode. I'll continue checking the output which so far looks perfect. The count would be of each mode for 12 months before and up to the date of each offence. same data, but if it is too much trouble of frustrating, no need to persist.

Last edited by Bob Green; 23 Sep 2018, 20:35.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30147

23 Sep 2018, 22:39

The frustration was not with the problem being posed, it was with the changing of the terms of the problem.

Anyway, the following code will give you the count of each type of service occurring between 1 year before the offence date and the offence date itself. If you don't want to include the actual offence date (i.e. you want to stop the day before) change the -gen latest- command to read

Code:

gen latest = offencedate - 1

I decided to interpret your "12 months before" to mean starting with the same calendar date in the preceding calendar year. Thus, 12 months before, say, March 10, 20XX will be March 10 of 20XX-1, regardless of whether 20XX is a leap year. (The alternative is to use -gen earliest = offencedate - 365- so as to assure that the period is always of the same length.)

To use this code you will need the -rangejoin- command, written by Robert Picard and available from SSC. To run -rangejoin- you will also need the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC. Both of these commands are extremely handy to have and you will find many other uses for them.

By the way, you can, if you wish, apply this code to the output of the previous code that selected closest service dates. It would look like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte studyid float(offencedate startdate) str9 mode
1 18794 18652 "In Person"
1 18794 18653 "In Person"
1 18794 18696 "In Person"
1 19075 18859 "Telephone"
1 19086 18911 "In Person"
1 19086 18939 "In Person"
1 19086 18939 "In Person"
1 19086 18941 "Other"    
1 19093 18933 "Other"    
1 19093 18934 "In Person"
1 19130 18884 "In Person"
1 19130 18911 "In Person"
1 19130 19012 "Telephone"
1 19130 19044 "In Person"
1 19172 18859 "Telephone"
1 19172 18878 "In Person"
1 19172 18884 "In Person"
1 19172 18911 "In Person"
1 19172 18911 "In Person"
1 19172 18911 "In Person"
1 19172 18925 "In Person"
1 19172 18933 "Other"    
1 19172 18934 "In Person"
1 19177 18939 "In Person"
1 19177 18939 "In Person"
2 19086 18925 "In Person"
2 19086 18933 "Other"    
2 19086 18934 "In Person"
2 19086 18948 "Other"    
2 19086 18953 "Other"    
2 19093 18911 "In Person"
2 19093 19044 "In Person"
2 19130 18859 "Telephone"
2 19130 18878 "In Person"
2 19177 18925 "In Person"
2 19177 18933 "Other"    
2 19177 18934 "In Person"
3 19093 18925 "In Person"
3 19093 19012 "Telephone"
3 19130 18911 "In Person"
3 19130 18911 "In Person"
3 19130 18955 "In Person"
3 19130 18976 "In Person"
3 19172 18939 "In Person"
3 19177 18911 "In Person"
3 19177 18911 "In Person"
end
format %td offencedate
format %td startdate

// SEPARATE THE DATA INTO AN OFFENCE DATA SET
// AND A SERVICE DATA SET

preserve
keep studyid offencedate
tempfile offences
save `offences'

restore
keep studyid startdate mode
tempfile services
save `services'

// PAIR EACH OFFENSE WITH EACH SERVICE WITHIN STUDYID
use `offences', clear
joinby studyid using `services'

// CALCULATE TIME BETWEEN EACH OFFENSE AND EACH SERVICE
// AND THEN KEEP THE CLOSEST ONE
by studyid offencedate, sort: gen delta = offencedate - startdate
drop if delta < 0
by studyid offencedate (delta startdate), sort: keep if _n == 1
rename startdate closest_service_date_any_mode
rename mode mode_closest_service
drop delta

// DO IT AGAIN, THIS TIME SEPARATING EACH SERVICE TYPE
joinby studyid using `services'
by studyid offencedate, sort: gen delta = offencedate - startdate
drop if delta < 0
by studyid offencedate mode (delta startdate), sort: keep if _n == 1
rename startdate nearest_date_

// NOW GO WIDE ON MODE OF SERVICE
replace mode = strtoname(lower(mode))
drop delta
reshape wide nearest_date, i(studyid offencedate) j(mode) string


//    CALCULATE COUNT OF SERVICES CARRIED OUT WITHIN 12 MONTHS BEORE AND
//    UP TO DATE OF EACH OFFENSE.
gen earliest  = mdy(month(offencedate), day(offencedate), year(offencedate)-1)
gen latest = offencedate
rangejoin startdate earliest latest using `services', by(studyid)
by studyid offencedate mode, sort: gen _count = _N
by studyid offencedate mode: keep if _n == 1
replace mode = strtoname(lower(mode))
keep studyid nearest_* closest_* offencedate mode _count
reshape wide @_count, i(studyid offencedate) j(mode) string
mvencode *_count, mv(0)

Comment

Bob Green

Join Date: Oct 2015

Posts: 85
#10

24 Sep 2018, 02:48

Clyde,

Once I dropped cases with missing values and some entire blank rows at the end of the data, the code ran fine . I'll look into the uses of rangestat and rangejoin. Thanks again.

Bob

Last edited by Bob Green; 24 Sep 2018, 03:47.
Comment

Announcement

complex sort & identify closest date to each target date

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment