Dealing with duplicate time stamps in auction data

Anton Ivanov

Join Date: Sep 2014
Posts: 267

Dealing with duplicate time stamps in auction data

11 Oct 2022, 22:35

Hello!

I am working with a longitudinal panel dataset collected from an online auction. For each observation, there is a time stamp that includes a date together with hour and minute, but no second. This results in a problem as there could be multiple observations in a given minute for the same id. The solution I seek your help implementing implies adding a single consecutive second to each of the duplicate time stamps. For example:

16mar2021 14:40:00 --> 16mar2021 14:40:01
16mar2021 14:40:00 --> 16mar2021 14:40:02
16mar2021 14:40:00 --> 16mar2021 14:40:03

The example of the data for experimentation is provided below. Your feedback is greatly appreciated.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float id double datetime
11291 1.9309035e+12
11291 1930903680000
11291 1930903920000
11291 1930904280000
11291 1.9309059e+12
11291 1.9309068e+12
11291 1930907580000
11291 1930907880000
11291 1930907940000
11291  1.930908e+12
11291 1930908060000
11291 1930908180000
11291 1930915320000
11291 1930921260000
11291 1930989960000
11291 1931013240000
11291 1.9311711e+12
11291 1931271240000
11291 1.9313307e+12
11291 1.9313805e+12
11291 1.9313856e+12
11291 1931424480000
11291 1.9314276e+12
11291 1931432940000
11291 1931512740000
11291 1931512860000
11291 1.9315182e+12
11291 1931521260000
11291 1931521680000
11291 1931521980000
11291 1.9315221e+12
11291 1931522220000
11291 1931522580000
11291 1931522760000
11291 1931522880000
11291 1.9315233e+12
11291 1931524440000
11291 1931524440000
11291 1931524620000
11291 1931524620000
11291 1931524680000
11291 1931524680000
11291 1931524680000
11291 1931524680000
11291 1931524740000
11291 1931524740000
11291 1.9315248e+12
11291 1.9315248e+12
11291 1.9315248e+12
11291 1.9315248e+12
11291 1.9315248e+12
11291 1931524860000
11291 1931524860000
11291 1931524860000
11291 1931524860000
11291 1931524920000
11291 1931524980000
11291 1931525040000
11291 1931527740000
11291 1931537220000
11291 1931975880000
end
format %tc datetime

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

12 Oct 2022, 12:04

The example data didn't seem to have duplicated time stamps, so used it to create more suitable example data.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float id double datetime
11291 1.9309035e+12
11291 1930903680000
11291 1930903680000
11291 1930903680000
11291 1930903680000
11291 1930903680000
11291 1930903680000
11291 1.9309059e+12
end
format %tc datetime
bysort id datetime: generate double dts = datetime+cond(_N==1,0,1000*_n)
format %tc dts
list, clean

Code:

. list, clean

          id             datetime                  dts  
  1.   11291   09mar2021 10:05:00   09mar2021 10:05:00  
  2.   11291   09mar2021 10:08:00   09mar2021 10:08:01  
  3.   11291   09mar2021 10:08:00   09mar2021 10:08:02  
  4.   11291   09mar2021 10:08:00   09mar2021 10:08:03  
  5.   11291   09mar2021 10:08:00   09mar2021 10:08:04  
  6.   11291   09mar2021 10:08:00   09mar2021 10:08:05  
  7.   11291   09mar2021 10:08:00   09mar2021 10:08:06  
  8.   11291   09mar2021 10:45:00   09mar2021 10:45:00

The key to this is that Stata datetime (clock) values are stored in milliseconds, so adding 1000 to a value adds 1 second.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#3

12 Oct 2022, 12:34

William Lisowski's code is, as always, spot on.

But I am uneasy about doing this in the first place. You are imposing an arbitrary order on the observations that have the same timestamp. If you have some analyses that really depend on the order among these observations, then you will in all probability get wrong results because it is unlikely that the randomized order you are creating will correspond to reality. If you don't have any analyses that actually depend on the order among these observations, what is the reason for imposing an order on them at all? Why not just leave the data alone?
2 likes
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#4

12 Oct 2022, 13:15

William Lisowski Thank you very much for the solution, worked perfectly.

Clyde Schechter I appreciate your comment, Clyde, as this is a reasonable concern. To this end, the data were collected in an ordered way, so by adding a second, I can preserve the correct sequence of observations.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#5

12 Oct 2022, 14:03

To this end, the data were collected in an ordered way, so by adding a second, I can preserve the correct sequence of observations.

In that case, you need to modify the code slightly. As written, it can randomly shuffle the order of the observations within a group having the same id and time stamp.

Code:

sort id timestamp, stable by id datetime: generate double dts = datetime+cond(_N==1,0,1000*_n) // NOTE by, NOT bysort

This code guarantees that the sort preserves the ordering within id timestamp groups.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

12 Oct 2022, 14:13

Clyde corrects my oversight, but meant to type

Code:

sort id datetime, stable by id datetime: generate double dts = datetime+cond(_N==1,0,1000*_n) // NOTE by, NOT bysort

although I agree with the implication that timestamp would have been a more appropriate variable name than datetime, which is already overloaded with meaning in Stata.
2 likes
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#7

12 Oct 2022, 14:27

Clyde Schechter William Lisowski Thank you very much for clarifications!
Comment

Announcement

Dealing with duplicate time stamps in auction data

Comment

Comment

Comment

Comment

Comment

Comment