Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with duplicate time stamps in auction data

    Hello!

    I am working with a longitudinal panel dataset collected from an online auction. For each observation, there is a time stamp that includes a date together with hour and minute, but no second. This results in a problem as there could be multiple observations in a given minute for the same id. The solution I seek your help implementing implies adding a single consecutive second to each of the duplicate time stamps. For example:

    16mar2021 14:40:00 --> 16mar2021 14:40:01
    16mar2021 14:40:00 --> 16mar2021 14:40:02
    16mar2021 14:40:00 --> 16mar2021 14:40:03

    The example of the data for experimentation is provided below. Your feedback is greatly appreciated.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float id double datetime
    11291 1.9309035e+12
    11291 1930903680000
    11291 1930903920000
    11291 1930904280000
    11291 1.9309059e+12
    11291 1.9309068e+12
    11291 1930907580000
    11291 1930907880000
    11291 1930907940000
    11291  1.930908e+12
    11291 1930908060000
    11291 1930908180000
    11291 1930915320000
    11291 1930921260000
    11291 1930989960000
    11291 1931013240000
    11291 1.9311711e+12
    11291 1931271240000
    11291 1.9313307e+12
    11291 1.9313805e+12
    11291 1.9313856e+12
    11291 1931424480000
    11291 1.9314276e+12
    11291 1931432940000
    11291 1931512740000
    11291 1931512860000
    11291 1.9315182e+12
    11291 1931521260000
    11291 1931521680000
    11291 1931521980000
    11291 1.9315221e+12
    11291 1931522220000
    11291 1931522580000
    11291 1931522760000
    11291 1931522880000
    11291 1.9315233e+12
    11291 1931524440000
    11291 1931524440000
    11291 1931524620000
    11291 1931524620000
    11291 1931524680000
    11291 1931524680000
    11291 1931524680000
    11291 1931524680000
    11291 1931524740000
    11291 1931524740000
    11291 1.9315248e+12
    11291 1.9315248e+12
    11291 1.9315248e+12
    11291 1.9315248e+12
    11291 1.9315248e+12
    11291 1931524860000
    11291 1931524860000
    11291 1931524860000
    11291 1931524860000
    11291 1931524920000
    11291 1931524980000
    11291 1931525040000
    11291 1931527740000
    11291 1931537220000
    11291 1931975880000
    end
    format %tc datetime

  • #2
    The example data didn't seem to have duplicated time stamps, so used it to create more suitable example data.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float id double datetime
    11291 1.9309035e+12
    11291 1930903680000
    11291 1930903680000
    11291 1930903680000
    11291 1930903680000
    11291 1930903680000
    11291 1930903680000
    11291 1.9309059e+12
    end
    format %tc datetime
    bysort id datetime: generate double dts = datetime+cond(_N==1,0,1000*_n)
    format %tc dts
    list, clean
    Code:
    . list, clean
    
              id             datetime                  dts  
      1.   11291   09mar2021 10:05:00   09mar2021 10:05:00  
      2.   11291   09mar2021 10:08:00   09mar2021 10:08:01  
      3.   11291   09mar2021 10:08:00   09mar2021 10:08:02  
      4.   11291   09mar2021 10:08:00   09mar2021 10:08:03  
      5.   11291   09mar2021 10:08:00   09mar2021 10:08:04  
      6.   11291   09mar2021 10:08:00   09mar2021 10:08:05  
      7.   11291   09mar2021 10:08:00   09mar2021 10:08:06  
      8.   11291   09mar2021 10:45:00   09mar2021 10:45:00
    The key to this is that Stata datetime (clock) values are stored in milliseconds, so adding 1000 to a value adds 1 second.

    Comment


    • #3
      William Lisowski's code is, as always, spot on.

      But I am uneasy about doing this in the first place. You are imposing an arbitrary order on the observations that have the same timestamp. If you have some analyses that really depend on the order among these observations, then you will in all probability get wrong results because it is unlikely that the randomized order you are creating will correspond to reality. If you don't have any analyses that actually depend on the order among these observations, what is the reason for imposing an order on them at all? Why not just leave the data alone?

      Comment


      • #4
        William Lisowski Thank you very much for the solution, worked perfectly.

        Clyde Schechter I appreciate your comment, Clyde, as this is a reasonable concern. To this end, the data were collected in an ordered way, so by adding a second, I can preserve the correct sequence of observations.

        Comment


        • #5
          To this end, the data were collected in an ordered way, so by adding a second, I can preserve the correct sequence of observations.
          In that case, you need to modify the code slightly. As written, it can randomly shuffle the order of the observations within a group having the same id and time stamp.

          Code:
          sort id timestamp, stable
          by id datetime: generate double dts = datetime+cond(_N==1,0,1000*_n) // NOTE by, NOT bysort
          This code guarantees that the sort preserves the ordering within id timestamp groups.

          Comment


          • #6
            Clyde corrects my oversight, but meant to type
            Code:
            sort id datetime, stable
            by id datetime: generate double dts = datetime+cond(_N==1,0,1000*_n) // NOTE by, NOT bysort
            although I agree with the implication that timestamp would have been a more appropriate variable name than datetime, which is already overloaded with meaning in Stata.

            Comment


            • #7
              Clyde Schechter William Lisowski Thank you very much for clarifications!

              Comment

              Working...
              X