Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • complex sort & identify closest date to each target date

    In the dataset below there are 5 variables: studyid (individual person identifier), offencedate (date an offence was committed), startdate (date service provided) and mode (type of service). modeorder is created by the code below

    For each individual I want to find the startdate closest to each offence. Ideally I would do this for each mode and all modes combined.

    The code below sorts the data but the mode order only lists the closest service date to the last offence, not to each offence for each studyid.

    The output I am hoping for would be by studyid. For studyid =1, there are 7 offence dates, so for all modes combined there would be 7 rows, service closest to 16/06/2011 would be 3/10/11 In person, while for the last offence on 8/11/2011 In person

    Any assistance is appreciated.


    Code:
    by studyid (offencedate startdate mode), sort: generate modeorder= _n
    Data


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte studyid float(offencedate startdate) str9 mode float modeorder
    1 18794 18652 "In Person"  1
    1 18794 18653 "In Person"  2
    1 18794 18696 "In Person"  3
    1 19075 18859 "Telephone"  4
    1 19086 18911 "In Person"  5
    1 19086 18939 "In Person"  6
    1 19086 18939 "In Person"  7
    1 19086 18941 "Other"      8
    1 19093 18933 "Other"      9
    1 19093 18934 "In Person" 10
    1 19130 18884 "In Person" 11
    1 19130 18911 "In Person" 12
    1 19130 19012 "Telephone" 13
    1 19130 19044 "In Person" 14
    1 19172 18859 "Telephone" 15
    1 19172 18878 "In Person" 16
    1 19172 18884 "In Person" 17
    1 19172 18911 "In Person" 18
    1 19172 18911 "In Person" 19
    1 19172 18911 "In Person" 20
    1 19172 18925 "In Person" 21
    1 19172 18933 "Other"     22
    1 19172 18934 "In Person" 23
    1 19177 18939 "In Person" 24
    1 19177 18939 "In Person" 25
    2 19086 18925 "In Person"  1
    2 19086 18933 "Other"      2
    2 19086 18934 "In Person"  3
    2 19086 18948 "Other"      4
    2 19086 18953 "Other"      5
    2 19093 18911 "In Person"  6
    2 19093 19044 "In Person"  7
    2 19130 18859 "Telephone"  8
    2 19130 18878 "In Person"  9
    2 19177 18925 "In Person" 10
    2 19177 18933 "Other"     11
    2 19177 18934 "In Person" 12
    3 19093 18925 "In Person"  1
    3 19093 19012 "Telephone"  2
    3 19130 18911 "In Person"  3
    3 19130 18911 "In Person"  4
    3 19130 18955 "In Person"  5
    3 19130 18976 "In Person"  6
    3 19172 18939 "In Person"  7
    3 19177 18911 "In Person"  8
    3 19177 18911 "In Person"  9
    end
    format %td offencedate
    format %td startdate
    Last edited by Bob Green; 22 Sep 2018, 19:34.

  • #2
    This will work with your example data (with the variable modeorder removed):
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte studyid float(offencedate startdate) str9 mode
    1 18794 18652 "In Person"
    1 18794 18653 "In Person"
    1 18794 18696 "In Person"
    1 19075 18859 "Telephone"
    1 19086 18911 "In Person"
    1 19086 18939 "In Person"
    1 19086 18939 "In Person"
    1 19086 18941 "Other"    
    1 19093 18933 "Other"    
    1 19093 18934 "In Person"
    1 19130 18884 "In Person"
    1 19130 18911 "In Person"
    1 19130 19012 "Telephone"
    1 19130 19044 "In Person"
    1 19172 18859 "Telephone"
    1 19172 18878 "In Person"
    1 19172 18884 "In Person"
    1 19172 18911 "In Person"
    1 19172 18911 "In Person"
    1 19172 18911 "In Person"
    1 19172 18925 "In Person"
    1 19172 18933 "Other"    
    1 19172 18934 "In Person"
    1 19177 18939 "In Person"
    1 19177 18939 "In Person"
    2 19086 18925 "In Person"
    2 19086 18933 "Other"    
    2 19086 18934 "In Person"
    2 19086 18948 "Other"    
    2 19086 18953 "Other"    
    2 19093 18911 "In Person"
    2 19093 19044 "In Person"
    2 19130 18859 "Telephone"
    2 19130 18878 "In Person"
    2 19177 18925 "In Person"
    2 19177 18933 "Other"    
    2 19177 18934 "In Person"
    3 19093 18925 "In Person"
    3 19093 19012 "Telephone"
    3 19130 18911 "In Person"
    3 19130 18911 "In Person"
    3 19130 18955 "In Person"
    3 19130 18976 "In Person"
    3 19172 18939 "In Person"
    3 19177 18911 "In Person"
    3 19177 18911 "In Person"
    end
    format %td offencedate
    format %td startdate
    
    //    SEPARATE THE DATA INTO AN OFFENCE DATA SET
    //    AND A SERVICE DATA SET
    
    preserve
    keep studyid offencedate
    tempfile offences
    save `offences'
    
    restore 
    keep studyid startdate mode
    tempfile services
    save `services'
    
    //    PAIR EACH OFFENSE WITH EACH SERVICE WITHIN STUDYID
    use `offences', clear
    joinby studyid using `services'
    
    //    CALCULATE TIME BETWEEN EACH OFFENSE AND EACH SERVICE
    //    AND THEN KEEP THE CLOSEST ONE
    by studyid offencedate, sort: gen delta = abs(startdate - offencedate)
    by studyid offencedate (delta startdate), sort: keep if _n == 1
    rename startdate closest_service_date_any_mode
    rename mode mode_closest_service
    drop delta
    
    //    DO IT AGAIN, THIS TIME SEPARATING EACH SERVICE TYPE
    joinby studyid using `services'
    by studyid offencedate, sort: gen delta = abs(startdate - offencedate)
    by studyid offencedate mode (delta startdate), sort: keep if _n == 1
    rename startdate nearest_date_
    
    //    NOW GO WIDE ON MODE OF SERVICE
    replace mode = strtoname(lower(mode))
    drop delta
    reshape wide nearest_date, i(studyid offencedate) j(mode) string
    Note: The criterion "nearest date" is actually ill-defined. It is possible to have two service dates that are equally near the offense date. One could be 3 days before and the other 3 days after. Or they could both be the same date. You do not specify how you want to handle ties like this. In the code above, if there is one before and one after, the one before is chosen. If there are two services on the same date before an offence, and nothing is closer then them before or after, then the tie is broken randomly and irreproducibly. If this handling of ties is not satisfactory, the code needs to be modified to fit your additional specifications.

    Comment


    • #3
      Clyde,

      Many thanks. In relation to your questions: Nearest date should, as you assumed be the date before not after. In the case of ties, it doesn't matter which is selected. I'll run separate analyses by mode. Ties will matter when I want to count the number of services.

      In relation to running the code, here is my output - the datafile is blank except for the variable names. I tried running it from the data file then form the dataex output in your post. I must be making some basic error

      Thanks again
      .

      Code:
       //    SEPARATE THE DATA INTO AN OFFENCE DATA SET
      . //    AND A SERVICE DATA SET
      . 
      . preserve
      
      . keep studyid offencedate
      
      . tempfile offences
      
      . save `offences'
      file C:\Users\Bob\AppData\Local\Temp\ST_07000002.tmp saved
      
      . 
      . restore 
      
      . keep studyid startdate mode
      
      . tempfile services
      
      . save `services'
      file C:\Users\Bob\AppData\Local\Temp\ST_07000003.tmp saved
      
      . 
      . //    PAIR EACH OFFENSE WITH EACH SERVICE WITHIN STUDYID
      . use `offences', clear
      
      . joinby studyid using `services'
      
      . 
      . //    CALCULATE TIME BETWEEN EACH OFFENSE AND EACH SERVICE
      . //    AND THEN KEEP THE CLOSEST ONE
      . by studyid offencedate, sort: gen delta = abs(startdate - offencedate)
      
      . by studyid offencedate (delta startdate), sort: keep if _n == 1
      (835 observations deleted)
      
      . rename startdate closest_service_date_any_mode
      
      . rename mode mode_closest_service
      
      . drop delta
      
      . 
      . //    DO IT AGAIN, THIS TIME SEPARATING EACH SERVICE TYPE
      . joinby studyid using `services'
      
      . by studyid offencedate, sort: gen delta = abs(startdate - offencedate)
      
      . by studyid offencedate mode (delta startdate), sort: keep if _n == 1
      (218 observations deleted)
      
      . rename startdate nearest_date_
      
      . 
      . //    NOW GO WIDE ON MODE OF SERVICE
      . replace mode = strtoname(lower(mode))
      (41 real changes made)
      
      . drop delta
      
      . reshape wide nearest_date, i(studyid offencedate) j(mode) string
      (note: j = in_person other telephone)
      
      Data                               long   ->   wide
      -----------------------------------------------------------------------------
      Number of obs.                       41   ->      15
      Number of variables                   6   ->       7
      j variable (3 values)              mode   ->   (dropped)
      xij variables:
                                nearest_date_   ->   nearest_date_in_person nearest_date_other nearest_date_telephone
      -----------------------------------------------------------------------------
      
      . 
      end of do-file


      Comment


      • #4
        Clyde,

        For some reason when I tried to run the code again, it ran without problem. On the second attempt it also ran on real data. A mystery.

        The main issue I noted on the cases I checked is that if there is a date after the offencedate that is closer or there is no prior startdate, the post offence date is entered.

        I can identify instances of this by subtracting all dates from the offence date and seeing how often this happens. If it is infrequent, I'll probably just manually edit them.


        Thanks again.

        Bob
        Last edited by Bob Green; 23 Sep 2018, 06:20.

        Comment


        • #5
          The main issue I noted on the cases I checked is that if there is a date after the offencedate that is closer or there is no prior startdate, the post offence date is entered.
          Why is that an issue? You said you wanted the closest date. If the closest date is after rather than a before date, then, yes, the closest date is entered: that's what you said you wanted.

          Don't manually edit data in real research; you need an audit trail of what you did, both for your own understanding if you return to this after several months, and for authors who want to know why they should believe your results. While you can, in a sense, make an audit trail of manual editing by doing it with a log file open, the resulting log, full of statements like -replace x = 47 in 3- is completely incomprehensible. You will not be able to remind yourself or explain to others what those changes were and why they were made.

          If you only want dates prior to the offence, then the code can be changed to do that:
          Code:
          // SEPARATE THE DATA INTO AN OFFENCE DATA SET
          // AND A SERVICE DATA SET
          
          preserve
          keep studyid offencedate
          tempfile offences
          save `offences'
          
          restore
          keep studyid startdate mode
          tempfile services
          save `services'
          
          // PAIR EACH OFFENSE WITH EACH SERVICE WITHIN STUDYID
          use `offences', clear
          joinby studyid using `services'
          
          // CALCULATE TIME BETWEEN EACH OFFENSE AND EACH SERVICE
          // AND THEN KEEP THE CLOSEST ONE
          by studyid offencedate, sort: gen delta = offencedate - startdate
          drop if delta < 0
          by studyid offencedate (delta startdate), sort: keep if _n == 1
          rename startdate closest_service_date_any_mode
          rename mode mode_closest_service
          drop delta
          
          // DO IT AGAIN, THIS TIME SEPARATING EACH SERVICE TYPE
          joinby studyid using `services'
          by studyid offencedate, sort: gen delta = offencedate - startdate
          drop if delta < 0
          by studyid offencedate mode (delta startdate), sort: keep if _n == 1
          rename startdate nearest_date_
          
          // NOW GO WIDE ON MODE OF SERVICE
          replace mode = strtoname(lower(mode))
          drop delta
          reshape wide nearest_date, i(studyid offencedate) j(mode) string
          
          list, noobs clean abbrev(16)
          Changes in bold face.

          Comment


          • #6
            Thanks Clyde. I'll study the code and the output. Going through it is likely to take a little time.

            It is always difficult to know how much information to provide. I'm interested in the 12 months prior to an offence and the potential for preventative action. There are some events of interest that happen afterwards but services provided isn't one of them. Closing the case would be.

            Knowing the offence date and date closest to an offence, can this code be modified to get a count of services provided or is this a completely different approach required.

            Comment


            • #7
              Bob, I'm getting frustrated. You keep changing the problem specifications.

              So before I invest any more time in further modifying the code to now accommodate having only dates in the 12 months prior to the offence, except for closing the case, and to get a count of all services, I'm going to ask you to give some serious consideration to exactly what you want and post back with that. And make sure you are clear about what constitutes a service; I can't guess whether case closure is a service or not. Provide a complete and unambiguous specification of exactly what you want. Also, repost a new data example that will exercise the code well. For example, if you want to look at case closures, especially under a different time frame, be sure that there are some case closures in the example you submit.

              Comment


              • #8
                Clyde,

                Apologies. My intention wasn't too frustrate but provide an example of a scenario where something may have been of interest after an offence. Forget about case closure it is another variable and has nothing to do with the data I posted up or my question. The only services are those covered by mode. I'll continue checking the output which so far looks perfect. The count would be of each mode for 12 months before and up to the date of each offence. same data, but if it is too much trouble of frustrating, no need to persist.
                Last edited by Bob Green; 23 Sep 2018, 20:35.

                Comment


                • #9
                  The frustration was not with the problem being posed, it was with the changing of the terms of the problem.

                  Anyway, the following code will give you the count of each type of service occurring between 1 year before the offence date and the offence date itself. If you don't want to include the actual offence date (i.e. you want to stop the day before) change the -gen latest- command to read
                  Code:
                  gen latest = offencedate - 1
                  I decided to interpret your "12 months before" to mean starting with the same calendar date in the preceding calendar year. Thus, 12 months before, say, March 10, 20XX will be March 10 of 20XX-1, regardless of whether 20XX is a leap year. (The alternative is to use -gen earliest = offencedate - 365- so as to assure that the period is always of the same length.)

                  To use this code you will need the -rangejoin- command, written by Robert Picard and available from SSC. To run -rangejoin- you will also need the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC. Both of these commands are extremely handy to have and you will find many other uses for them.

                  By the way, you can, if you wish, apply this code to the output of the previous code that selected closest service dates. It would look like this:

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input byte studyid float(offencedate startdate) str9 mode
                  1 18794 18652 "In Person"
                  1 18794 18653 "In Person"
                  1 18794 18696 "In Person"
                  1 19075 18859 "Telephone"
                  1 19086 18911 "In Person"
                  1 19086 18939 "In Person"
                  1 19086 18939 "In Person"
                  1 19086 18941 "Other"    
                  1 19093 18933 "Other"    
                  1 19093 18934 "In Person"
                  1 19130 18884 "In Person"
                  1 19130 18911 "In Person"
                  1 19130 19012 "Telephone"
                  1 19130 19044 "In Person"
                  1 19172 18859 "Telephone"
                  1 19172 18878 "In Person"
                  1 19172 18884 "In Person"
                  1 19172 18911 "In Person"
                  1 19172 18911 "In Person"
                  1 19172 18911 "In Person"
                  1 19172 18925 "In Person"
                  1 19172 18933 "Other"    
                  1 19172 18934 "In Person"
                  1 19177 18939 "In Person"
                  1 19177 18939 "In Person"
                  2 19086 18925 "In Person"
                  2 19086 18933 "Other"    
                  2 19086 18934 "In Person"
                  2 19086 18948 "Other"    
                  2 19086 18953 "Other"    
                  2 19093 18911 "In Person"
                  2 19093 19044 "In Person"
                  2 19130 18859 "Telephone"
                  2 19130 18878 "In Person"
                  2 19177 18925 "In Person"
                  2 19177 18933 "Other"    
                  2 19177 18934 "In Person"
                  3 19093 18925 "In Person"
                  3 19093 19012 "Telephone"
                  3 19130 18911 "In Person"
                  3 19130 18911 "In Person"
                  3 19130 18955 "In Person"
                  3 19130 18976 "In Person"
                  3 19172 18939 "In Person"
                  3 19177 18911 "In Person"
                  3 19177 18911 "In Person"
                  end
                  format %td offencedate
                  format %td startdate
                  
                  // SEPARATE THE DATA INTO AN OFFENCE DATA SET
                  // AND A SERVICE DATA SET
                  
                  preserve
                  keep studyid offencedate
                  tempfile offences
                  save `offences'
                  
                  restore
                  keep studyid startdate mode
                  tempfile services
                  save `services'
                  
                  // PAIR EACH OFFENSE WITH EACH SERVICE WITHIN STUDYID
                  use `offences', clear
                  joinby studyid using `services'
                  
                  // CALCULATE TIME BETWEEN EACH OFFENSE AND EACH SERVICE
                  // AND THEN KEEP THE CLOSEST ONE
                  by studyid offencedate, sort: gen delta = offencedate - startdate
                  drop if delta < 0
                  by studyid offencedate (delta startdate), sort: keep if _n == 1
                  rename startdate closest_service_date_any_mode
                  rename mode mode_closest_service
                  drop delta
                  
                  // DO IT AGAIN, THIS TIME SEPARATING EACH SERVICE TYPE
                  joinby studyid using `services'
                  by studyid offencedate, sort: gen delta = offencedate - startdate
                  drop if delta < 0
                  by studyid offencedate mode (delta startdate), sort: keep if _n == 1
                  rename startdate nearest_date_
                  
                  // NOW GO WIDE ON MODE OF SERVICE
                  replace mode = strtoname(lower(mode))
                  drop delta
                  reshape wide nearest_date, i(studyid offencedate) j(mode) string
                  
                  
                  //    CALCULATE COUNT OF SERVICES CARRIED OUT WITHIN 12 MONTHS BEORE AND
                  //    UP TO DATE OF EACH OFFENSE.
                  gen earliest  = mdy(month(offencedate), day(offencedate), year(offencedate)-1)
                  gen latest = offencedate
                  rangejoin startdate earliest latest using `services', by(studyid)
                  by studyid offencedate mode, sort: gen _count = _N
                  by studyid offencedate mode: keep if _n == 1
                  replace mode = strtoname(lower(mode))
                  keep studyid nearest_* closest_* offencedate mode _count
                  reshape wide @_count, i(studyid offencedate) j(mode) string
                  mvencode *_count, mv(0)

                  Comment


                  • #10
                    Clyde,

                    Once I dropped cases with missing values and some entire blank rows at the end of the data, the code ran fine . I'll look into the uses of rangestat and rangejoin. Thanks again.


                    Bob
                    Last edited by Bob Green; 24 Sep 2018, 03:47.

                    Comment

                    Working...
                    X