Trying to delete date duplicates for each panelID

Sushmita Joshi

Join Date: Apr 2021

Posts: 5
#1

Trying to delete date duplicates for each panelID

18 Mar 2023, 09:47

Hello everyone,
I have a panel dataset with vehicle-id and refueling dates. Now I am cleaning this data and I am trying to randomly delete duplicate refueling dates for each vehicle id. This means that I cannot just use the duplicates command because I think all the duplicate refueling dates for other vehicle IDs will also be lost which I don't want.

Here is what I think might work but any other suggestions will be appreciated
set seed 1234
gen double shuffle1 = runiform()
gen double shuffle2 = runiform()
bysort vehicleid (fuelingdate shuffle1 shuffle2): keep if _n==1
drop shuffle

Kindly provide me with some suggestions for this:
vehicle id refueling dates

13 13feb2021

13 13feb2021

13 26feb2021

13 13feb2021

13 21mar2021
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17854

18 Mar 2023, 10:12

Sushmita:
you may want to try:

Code:

. bysort refueling_dates (vehicle_id): g wanted= runiform()
. bysort refueling_dates (vehicle_id): egen tool= max(wanted)
. keep if wanted==tool

. list

     +--------------------------------------------+
     | vehicl~d   refueli~s     wanted       tool |
     |--------------------------------------------|
  1. |       13   13feb2021   .3488717   .3488717 |
  2. |       13   21nar2021   .0285569   .0285569 |
  3. |       13   26feb2021   .8689333   .8689333 |
     +--------------------------------------------+

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Sushmita Joshi

Join Date: Apr 2021

Posts: 5
#3

19 Mar 2023, 05:51

Hello Carlos,
This somehow did not work the way I wanted it to work because it removed way too many observations, I just need to remove randomly the dates that show an odd behavior like people refueling three times on the same date, in such situations I just need anyone of the refueling specification. Could you suggest any other way for this.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17854
#4

19 Mar 2023, 11:26

Sushmita:
sorry, but I do not follow you.
The code I suggested, conditional on your example, keep one observation per date, deleting duplicates.
What are you after?

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#5

19 Mar 2023, 12:25

I think the code in #2 will select a single observation per refueling date, selecting at random a single vehicle to "represent" the date. That is actually what O.P. said was wanted in #1. But I don't think that was what he meant. I think he wanted to retain a single observation per combination of refueling date and vehicle--that is how I understand his response in #3. If I have that right, it is a minor modification of the code:

Code:

bysort refueling_dates vehicle_id: g double shuffle = runiform() bysort refueling_dates vehicle_id (shuffle): keep if _n == 1 drop shuffle

Note: the number of observations in the data set has not been disclosed in this thread. If it numbers in the tens of millions or more, then the code requires a little more work:

Code:

bysort refueling_dates vehicle_id: g double shuffle1 = runiform() bysort refueling_dates vehicle_id: g double shuffle2 = runiform() bysort refueling_dates vehicle_id (shuffle1 shuffle2): keep if _n == 1

Added: In order to assure reproducibility of this sampling when the code is run again, the random number generator seed should be set first.

Last edited by Clyde Schechter; 19 Mar 2023, 12:29.
2 likes
Comment
Sushmita Joshi

Join Date: Apr 2021

Posts: 5
#6

21 Mar 2023, 01:40

Yes, Clyde, this is exactly what I wanted, it worked perfectly. Thank you so much for your response
Comment

vehicle id	refueling dates
13	13feb2021
13	13feb2021
13	26feb2021
13	13feb2021
13	21mar2021

Announcement

Trying to delete date duplicates for each panelID

Comment

Comment

Comment

Comment

Comment