Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting number of events within a specific time range.

    Hi StataList,

    I have another question I was hoping you friendly folks can help me with.

    I have a dataset that includes variable id, day, and time. Each id represents an individual event that occurred on a specific day at a specific time. Many events occur every hour of each day. Day is coded as a number 1 through 12. Time is coded as a number 0.00 through 23.00.

    I want to create a variable that contains the number of events that occurred over the next 1 hour period after the time of the current event.

    So for example, id = 1 starts on day = 1 and the first event is at time = 17.25. I would like to create a new variable 1hr which would equal the number of events that occur between 17.25 and 18.25. Then if id = 2 starts on day = 1 and time = 17.30, 1hr would equal the number of events from 17.30 and 18.30.

    I hope this is clear. I really appreciate your help.

    Best,
    Rich

  • #2
    Without regular spacing of times, it's messy and slow.

    I'd advise that you set up a proper date-time variable first of all. Dealing with separate day and time variables would be a bad idea.

    Then it's a loop over observations: http://www.stata-journal.com/sjpdf.h...iclenum=pr0033

    Comment


    • #3
      Here's one approach to these types of problems. There's no need to loop over observations, you instead order observations by date and time (within id) and loop over lags/leads until there are no more cases that satisfy the condition. Something like

      Code:
      * create fake data
      clear
      input id str20 (day time)
      1 18aug2015 "10:00am"
      1 18aug2015 "10:01am"
      1 18aug2015 "10:11am"
      1 18aug2015 "11:02am"
      1 18aug2015 "11:22am"
      1 18aug2015 "11:59am"
      1 18aug2015 "1:30pm"
      2 18aug2015 "10:00am"
      2 18aug2015 "10:01am"
      end
      
      * convert day and time to numeric form
      gen double eventtime = clock(day + " " + time, "DMYhm")
      format %tc eventtime
      
      * for each id, order events by increasing date and time
      sort id eventtime
      
      * loop over increasing forward leads and add to the count if the
      * time difference is within 1 hour. Stop when there are no more cases
      * within 1 hour for any observation
      gen mycount = 0
      local more = 1
      local i = 0
      while `more' {
          local i = `i' + 1
          by id: gen doit = (eventtime[_n + `i'] - eventtime) <= msofhours(1)
          replace mycount = mycount + doit
          count if doit
          local more = r(N)
          drop doit
      }
      
      list, sepby(id) noobs
      Last edited by Robert Picard; 18 Aug 2015, 09:37.

      Comment


      • #4
        Thanks once again for your help Nick.
        I have little experience using dates in stata but took your advice to use a properly formatted date instead of the clunky two variable system I had set up.
        The article you provided was also very helpful.
        My solution ended up being:

        gen n60 = .
        local N = _N
        forval i = 1/`N'{
        count if inrange(date_obs[`i']-date_obs,mdyhms(1,1,1960,0,0,0),mdyhms(1,1,1960,1, 0,0))
        replace n60 = r(N) in `i'
        }

        Comment


        • #5
          Robert Picard has what looks likely to be a faster approach in general.

          In respect of your code: you are evaluating two constants again and again for every observation, so this would be faster in future:

          Code:
           
          gen n60 = .
          local N = _N
          quietly forval i = 1/`N' { 
               count if inrange(date_obs[`i']-date_obs, 0, 36e5)
               replace n60 = r(N) in `i'
          }
          Your code pools events regardless of identifier. I take it that is what you want.

          Comment

          Working...
          X