Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trying to delete date duplicates for each panelID


    Hello everyone,
    I have a panel dataset with vehicle-id and refueling dates. Now I am cleaning this data and I am trying to randomly delete duplicate refueling dates for each vehicle id. This means that I cannot just use the duplicates command because I think all the duplicate refueling dates for other vehicle IDs will also be lost which I don't want.

    Here is what I think might work but any other suggestions will be appreciated
    set seed 1234
    gen double shuffle1 = runiform()
    gen double shuffle2 = runiform()
    bysort vehicleid (fuelingdate shuffle1 shuffle2): keep if _n==1
    drop shuffle

    Kindly provide me with some suggestions for this:
    vehicle id refueling dates
    13 13feb2021
    13 13feb2021
    13 26feb2021
    13 13feb2021
    13 21mar2021

  • #2
    Sushmita:
    you may want to try:
    Code:
    . bysort refueling_dates (vehicle_id): g wanted= runiform()
    . bysort refueling_dates (vehicle_id): egen tool= max(wanted)
    . keep if wanted==tool
    
    . list
    
         +--------------------------------------------+
         | vehicl~d   refueli~s     wanted       tool |
         |--------------------------------------------|
      1. |       13   13feb2021   .3488717   .3488717 |
      2. |       13   21nar2021   .0285569   .0285569 |
      3. |       13   26feb2021   .8689333   .8689333 |
         +--------------------------------------------+
    
    .
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hello Carlos,
      This somehow did not work the way I wanted it to work because it removed way too many observations, I just need to remove randomly the dates that show an odd behavior like people refueling three times on the same date, in such situations I just need anyone of the refueling specification. Could you suggest any other way for this.

      Comment


      • #4
        Sushmita:
        sorry, but I do not follow you.
        The code I suggested, conditional on your example, keep one observation per date, deleting duplicates.
        What are you after?
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          I think the code in #2 will select a single observation per refueling date, selecting at random a single vehicle to "represent" the date. That is actually what O.P. said was wanted in #1. But I don't think that was what he meant. I think he wanted to retain a single observation per combination of refueling date and vehicle--that is how I understand his response in #3. If I have that right, it is a minor modification of the code:
          Code:
          bysort refueling_dates vehicle_id: g double shuffle = runiform()
          bysort refueling_dates vehicle_id (shuffle): keep if _n == 1
          drop shuffle
          Note: the number of observations in the data set has not been disclosed in this thread. If it numbers in the tens of millions or more, then the code requires a little more work:
          Code:
          bysort refueling_dates vehicle_id: g double shuffle1 = runiform()
          bysort refueling_dates vehicle_id: g double shuffle2 = runiform()
          bysort refueling_dates vehicle_id (shuffle1 shuffle2): keep if _n == 1
          Added: In order to assure reproducibility of this sampling when the code is run again, the random number generator seed should be set first.
          Last edited by Clyde Schechter; 19 Mar 2023, 12:29.

          Comment


          • #6
            Yes, Clyde, this is exactly what I wanted, it worked perfectly. Thank you so much for your response

            Comment

            Working...
            X