Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapse/aggregate on hourly level

    Hello everyone,

    I have a dataset reporting diesel price changes for different petrol stations (station_uuid) during one day with the following structure:

    Code:
    clear
    input str36 station_uuid str22 raw_date double formatted_date float diesel
    "2471ee14-8beb-455f-942d-73733d462c01" "2021-01-01 07:49:08+01" 1925106548000 1.329
    "21a92daf-dec2-4448-b128-7f764b234dbc" "2021-01-01 10:27:17+01" 1925116037000 1.189
    "b21f117f-305e-44ee-87cc-7f21fe4f3f58" "2021-01-01 10:42:17+01" 1925116937000 1.319
    "096eb876-1888-4ec0-b64d-8d2369a319eb" "2021-01-01 11:04:14+01" 1925118254000 1.199
    "25d5e86b-fc8b-479a-434e-3e852f2aefe2" "2021-01-01 11:22:16+01" 1925119336000 1.189
    "5e40a39a-f679-480a-aaac-81754a28e003" "2021-01-01 13:17:14+01" 1925126234000 1.169
    "65d95b09-e1cb-454f-b3da-49876ad84561" "2021-01-01 14:05:06+01" 1925129106000 1.249
    "51d4b70e-a095-1aa0-e100-80009459e03a" "2021-01-01 15:53:07+01" 1925135587000 1.179
    "c08111e0-37bb-4c2c-94bb-1e6e1c4bb1f5" "2021-01-01 16:23:07+01" 1925137387000 1.219
    "0c469754-2608-4e58-8ab7-6cd924edd5a5" "2021-01-01 16:27:13+01" 1925137633000 1.199
    "c1b456c8-b782-41d8-a960-466c4088a463" "2021-01-01 16:38:15+01" 1925138295000 1.199
    "ade023de-dad0-40e7-bcc3-0a25e9a85c77" "2021-01-01 16:55:06+01" 1925139306000 1.219
    "b457a782-e3d4-4513-a258-c4d3e199d79a" "2021-01-01 17:24:16+01" 1925141056000 1.239
    "826b3acc-d800-41ba-9ead-76bc0d1dba20" "2021-01-01 18:22:16+01" 1925144536000 1.259
    "812ecef1-650c-4930-a88b-71b9793607e6" "2021-01-01 18:31:07+01" 1925145067000 1.219
    "16941d49-7ec6-45fc-aea9-93901f8f2dff" "2021-01-01 20:17:13+01" 1925151433000 1.179
    "e1a15081-254f-9107-e040-0b0a3dfe563c" "2021-01-01 20:18:14+01" 1925151494000 1.179
    "0cc777e4-13bc-48d4-b161-0f23d56afea6" "2021-01-01 21:12:20+01" 1925154740000 1.109
    "e95bfbba-f829-45f7-ac4b-9bcab2fb48ee" "2021-01-01 21:17:13+01" 1925155033000 1.209
    "51d4b5ee-a095-1aa0-e100-80009459e03a" "2021-01-01 21:42:15+01" 1925156535000 1.169
    end
    format %tcDD_Mon_CCYY_HH:MM:SS formatted_date
    The variable formatted_date has been generated after studying chapter 25 of the user's guide and using the following command:

    Code:
    gen double formatted_date = clock(raw_date, "YMDhms#")
    Since I want an hourly panel, I thought of collapsing the data on an hourly level forming means if there is more than one price change per hour per station. In a second step, I would then fill down/expand the observations with the price of the previous hour for stations that have not reported a price change in given hour. I imagine the second step to look something like this:

    Code:
    tsset station_uuid final_date
    tsfill
    bysort station_uuid: carryforward diesel, gen(diesel_complete)
    My questions are:
    1) How do I collapse/aggregate on an hourly level?
    2) Is the procedure I have in mind correct or is there a better way to do it?

    Thank you so much in advance.

    Best regards,
    Benedikt Franz

  • #2
    A couple of clarification questions:

    1. Is the variable "diesel" the level of price after price changing, or the change in price?

    2. What's the boundary of the hourly panel expansion? For example, in your data, a petrol station changed prices on three dates, say 2021-01-01, 2021-05-21, and 2021-07-15, should the expanded panel only include the three dates (hours 0-23 for each of the three dates), or include every hour between 2021-01-01 and 2021-07-15, or even include every hour in a longer period (say between the earliest and latest possible dates in the data)?

    Comment


    • #3
      Hello Fei,

      thank you very much for your response. You are right, I should have clarified the problem more. To answer your questions:

      1. The variable "diesel" is the level of the "new price". So, whenever one petrol stations changes the price, this appears a new observation in the dataset with the corresponding date/time of change and the new price after the change.

      2. I would like the finished dataset to include every hour between a fixed start and end date (in my case between 2020-12-01 and 2021-01-31). Additionally I was wondering whether it would be possible to impose some working hours e.g. every day from 8am to 10pm.

      I hope that clarifies my question. Thanks again.

      Comment


      • #4
        Thanks for the clarification. Another issue for the second point. Once again assume a petrol station changed prices on the three dates listed in #2, then we would not be able to know the old prices between 2020-12-01 and 2021-01-01. In other words, all observations before the first price change of a petrol station would essentially be redundant. Is that ok?

        Comment


        • #5
          Below is a code to expand your sample, from the first price change to 2021-01-31 11pm for every petrol station. I haven't collapsed yet, but you may easily do that based on the expanded sample (based on "station_uuid" and the updated hourly "formatted_date").

          Code:
          gen date = dofc(formatted_date)
          gen hour = hh(formatted_date)
          gen datehour = date*24 + hour
          
          bys station_uuid (formatted_date): gen exp = cond(_n==_N, td(01-02-2021)*24-datehour, datehour[_n+1]-datehour)
          expand exp
          bys station_uuid (formatted_date): replace hour = cond(hour[_n-1]<23, hour[_n-1]+1, 0) if formatted_date == formatted_date[_n-1]
          bys station_uuid (formatted_date): replace datehour = datehour[_n-1] + 1 if formatted_date == formatted_date[_n-1]
          replace date = (datehour - hour) / 24
          
          replace formatted_date = dhms(date, hour, 0, 0)
          format formatted_date %tc
          
          drop exp date hour datehour
          Last edited by Fei Wang; 30 Nov 2021, 20:14.

          Comment


          • #6
            Thank you so much for your answer. I'm very sorry to say, that my clarification was not completely correct.

            The csv files I import contain all the changes of one day. So, the example data I provided is from the csv file for the 1st of January 2021. I would later then merge the daily dta files to the final dataset.

            Furthermore, the data also contains the prices for two other fuel types (e5 and e10). I did not include them in the first example, since I wanted to keep it simple. When there is only a change in diesel, the observation looks as follows:

            Code:
            input double formatted_date float(diesel e5 e10)
            1925078592000 1.249 0 0
            The 0 would lead to a problem when I mean collapse the data. So, I should replace the 0 with missing ("."), right?

            I imagine the final dataset per day to consist out of one price per hour, station and fuel type. If there was no price reported in any given hour, I would like to use the last reported price or the aggregated price from the previous hour (like shown in the example for e10 and 4am and 5am). So, something like:

            Code:
            input str36 station uuid double formatted_date float (diesel e5 e10)
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 00:00:00" 1.329 1.453 1.502
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 01:00:00" 1.320 1.484 1.567
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 02:00:00" 1.324 1.446 1.595
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 03:00:00" 1.329 1.483 1.595
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 04:00:00" 1.324 1.494 1.595
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 05:00:00" 1.322 1.442 1.534
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 06:00:00" 1.326 1.435 1.578
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 07:00:00" 1.327 1.456 1.565
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 08:00:00" 1.329 1.483 1.556
            "2471ee14-8beb-455f-942d-73733d462c01" "01 Jan 2021 09:00:00" 1.324 1.445 1.586
            I tried to achieve that using your approach but could not really figure it out. Alternatively, I managed to do the following:

            Code:
            gen date = dofc(formatted_date)
            gen hour = hh(formatted_date)
            
            gen double formatted_date = dhms(date, hour, 0, 0)
            format formatted_date %tc
            
            collapse (mean) diesel e5 e10 date, by(station_uuid hour)
            I would then setup the panel data and use e.g. tsfill to fill in the gaps. What do you think?

            I'm very sorry again for not being able to clarify things from the start.
            Last edited by Benedikt Franz; 30 Nov 2021, 22:54.

            Comment


            • #7
              I understood your request, but lacked a proper example data to show how the code in #5 works. Below is a constructed example for three stations (aaa, bbb, ccc) on 01/01/2021. Each line records new prices for diesel, e5 or e10. Three prices don't change at the same time, so unchanged prices are filled in 0. You may find diesel price changes twice within one hour for ccc -- this is going to be collapsed later.

              Code:
              clear
              input str6 station_uuid str22 rawdate float (diesel e5 e10)
              "aaa" "01 Jan 2021 07:21:20" 1.329 0 0
              "aaa" "01 Jan 2021 09:22:20" 0 1.484 0
              "aaa" "01 Jan 2021 14:12:11" 0 0 1.595
              "bbb" "01 Jan 2021 03:01:02" 0 1.483 0
              "bbb" "01 Jan 2021 03:58:03" 1.324 0 0
              "bbb" "01 Jan 2021 17:32:12" 0 0 1.534
              "bbb" "01 Jan 2021 19:12:00" 1.326 0 1.578
              "ccc" "01 Jan 2021 02:12:05" 1.327 0 1.598
              "ccc" "01 Jan 2021 08:10:09" 1.329 0 0
              "ccc" "01 Jan 2021 08:58:10" 1.324 1.532 0
              end
              Code:
                   +----------------------------------------------------------+
                   | statio~d                rawdate   diesel      e5     e10 |
                   |----------------------------------------------------------|
                1. |      aaa   01 Jan 2021 07:21:20    1.329       0       0 |
                2. |      aaa   01 Jan 2021 09:22:20        0   1.484       0 |
                3. |      aaa   01 Jan 2021 14:12:11        0       0   1.595 |
                4. |      bbb   01 Jan 2021 03:01:02        0   1.483       0 |
                5. |      bbb   01 Jan 2021 03:58:03    1.324       0       0 |
                   |----------------------------------------------------------|
                6. |      bbb   01 Jan 2021 17:32:12        0       0   1.534 |
                7. |      bbb   01 Jan 2021 19:12:00    1.326       0   1.578 |
                8. |      ccc   01 Jan 2021 02:12:05    1.327       0   1.598 |
                9. |      ccc   01 Jan 2021 08:10:09    1.329       0       0 |
               10. |      ccc   01 Jan 2021 08:58:10    1.324   1.532       0 |
                   +----------------------------------------------------------+
              My procedure is to fill in the gaps first, and then collapse. The code below is to fill in gaps, and replace missing prices with the last observed price in a previous hour.

              Code:
              gen double time = clock(rawdate, "DMYhms")
              gen date = dofc(time)
              gen hour = hh(time)
              gen datehour = date*24 + hour
              
              foreach var of varlist diesel e5 e10 {
                  bys station_uuid (time): replace `var' = `var'[_n-1] if `var' == 0
              }
              
              bys station_uuid (time): gen exp = cond(_n==_N, td(02-01-2021)*24-datehour, datehour[_n+1]-datehour)
              expand exp
              bys station_uuid (time): replace hour = cond(hour[_n-1]<23, hour[_n-1]+1, 0) if time == time[_n-1]
              bys station_uuid (time): replace datehour = datehour[_n-1] + 1 if time == time[_n-1]
              replace date = (datehour - hour) / 24
              
              replace time = dhms(date, hour, 0, 0)
              format time %tc
              
              drop exp date hour datehour
              After running the code, you would obtain the expanded hourly panel data below. The column "time" is the hourly identifier, from the earliest hour that a price is observed to the last hour of the day. If a station has price records through multiple days, my code above works as well -- just to replace the blue part with a desired ending date. All missing prices are replaced by the last observed price except for the missing prices at the very beginning.

              Code:
                   +--------------------------------------------------------+
                   | statio~d                 time   diesel      e5     e10 |
                   |--------------------------------------------------------|
                1. |      aaa   01jan2021 07:00:00    1.329       .       . |
                2. |      aaa   01jan2021 08:00:00    1.329       .       . |
                3. |      aaa   01jan2021 09:00:00    1.329   1.484       . |
                4. |      aaa   01jan2021 10:00:00    1.329   1.484       . |
                5. |      aaa   01jan2021 11:00:00    1.329   1.484       . |
                   |--------------------------------------------------------|
                6. |      aaa   01jan2021 12:00:00    1.329   1.484       . |
                7. |      aaa   01jan2021 13:00:00    1.329   1.484       . |
                8. |      aaa   01jan2021 14:00:00    1.329   1.484   1.595 |
                9. |      aaa   01jan2021 15:00:00    1.329   1.484   1.595 |
               10. |      aaa   01jan2021 16:00:00    1.329   1.484   1.595 |
                   |--------------------------------------------------------|
               11. |      aaa   01jan2021 17:00:00    1.329   1.484   1.595 |
               12. |      aaa   01jan2021 18:00:00    1.329   1.484   1.595 |
               13. |      aaa   01jan2021 19:00:00    1.329   1.484   1.595 |
               14. |      aaa   01jan2021 20:00:00    1.329   1.484   1.595 |
               15. |      aaa   01jan2021 21:00:00    1.329   1.484   1.595 |
                   |--------------------------------------------------------|
               16. |      aaa   01jan2021 22:00:00    1.329   1.484   1.595 |
               17. |      aaa   01jan2021 23:00:00    1.329   1.484   1.595 |
               18. |      bbb   01jan2021 03:00:00        .   1.483       . |
               19. |      bbb   01jan2021 03:00:00    1.324   1.483       . |
               20. |      bbb   01jan2021 04:00:00    1.324   1.483       . |
                   |--------------------------------------------------------|
               21. |      bbb   01jan2021 05:00:00    1.324   1.483       . |
               22. |      bbb   01jan2021 06:00:00    1.324   1.483       . |
               23. |      bbb   01jan2021 07:00:00    1.324   1.483       . |
               24. |      bbb   01jan2021 08:00:00    1.324   1.483       . |
               25. |      bbb   01jan2021 09:00:00    1.324   1.483       . |
                   |--------------------------------------------------------|
               26. |      bbb   01jan2021 10:00:00    1.324   1.483       . |
               27. |      bbb   01jan2021 11:00:00    1.324   1.483       . |
               28. |      bbb   01jan2021 12:00:00    1.324   1.483       . |
               29. |      bbb   01jan2021 13:00:00    1.324   1.483       . |
               30. |      bbb   01jan2021 14:00:00    1.324   1.483       . |
                   |--------------------------------------------------------|
               31. |      bbb   01jan2021 15:00:00    1.324   1.483       . |
               32. |      bbb   01jan2021 16:00:00    1.324   1.483       . |
               33. |      bbb   01jan2021 17:00:00    1.324   1.483   1.534 |
               34. |      bbb   01jan2021 18:00:00    1.324   1.483   1.534 |
               35. |      bbb   01jan2021 19:00:00    1.326   1.483   1.578 |
                   |--------------------------------------------------------|
               36. |      bbb   01jan2021 20:00:00    1.326   1.483   1.578 |
               37. |      bbb   01jan2021 21:00:00    1.326   1.483   1.578 |
               38. |      bbb   01jan2021 22:00:00    1.326   1.483   1.578 |
               39. |      bbb   01jan2021 23:00:00    1.326   1.483   1.578 |
               40. |      ccc   01jan2021 02:00:00    1.327       .   1.598 |
                   |--------------------------------------------------------|
               41. |      ccc   01jan2021 03:00:00    1.327       .   1.598 |
               42. |      ccc   01jan2021 04:00:00    1.327       .   1.598 |
               43. |      ccc   01jan2021 05:00:00    1.327       .   1.598 |
               44. |      ccc   01jan2021 06:00:00    1.327       .   1.598 |
               45. |      ccc   01jan2021 07:00:00    1.327       .   1.598 |
                   |--------------------------------------------------------|
               46. |      ccc   01jan2021 08:00:00    1.329       .   1.598 |
               47. |      ccc   01jan2021 08:00:00    1.324   1.532   1.598 |
               48. |      ccc   01jan2021 09:00:00    1.324   1.532   1.598 |
               49. |      ccc   01jan2021 10:00:00    1.324   1.532   1.598 |
               50. |      ccc   01jan2021 11:00:00    1.324   1.532   1.598 |
                   |--------------------------------------------------------|
               51. |      ccc   01jan2021 12:00:00    1.324   1.532   1.598 |
               52. |      ccc   01jan2021 13:00:00    1.324   1.532   1.598 |
               53. |      ccc   01jan2021 14:00:00    1.324   1.532   1.598 |
               54. |      ccc   01jan2021 15:00:00    1.324   1.532   1.598 |
               55. |      ccc   01jan2021 16:00:00    1.324   1.532   1.598 |
                   |--------------------------------------------------------|
               56. |      ccc   01jan2021 17:00:00    1.324   1.532   1.598 |
               57. |      ccc   01jan2021 18:00:00    1.324   1.532   1.598 |
               58. |      ccc   01jan2021 19:00:00    1.324   1.532   1.598 |
               59. |      ccc   01jan2021 20:00:00    1.324   1.532   1.598 |
               60. |      ccc   01jan2021 21:00:00    1.324   1.532   1.598 |
                   |--------------------------------------------------------|
               61. |      ccc   01jan2021 22:00:00    1.324   1.532   1.598 |
               62. |      ccc   01jan2021 23:00:00    1.324   1.532   1.598 |
                   +--------------------------------------------------------+
              Then, you may run the following one-line code to collapse the sample, which gives you one hour per line, and original prices within one hour are averaged, like the red part.

              Code:
              collapse (mean) diesel e5 e10, by(station_uuid time)
              
                   +--------------------------------------------------------+
                   | statio~d                 time   diesel      e5     e10 |
                   |--------------------------------------------------------|
                1. |      aaa   01jan2021 07:00:00    1.329       .       . |
                2. |      aaa   01jan2021 08:00:00    1.329       .       . |
                3. |      aaa   01jan2021 09:00:00    1.329   1.484       . |
                4. |      aaa   01jan2021 10:00:00    1.329   1.484       . |
                5. |      aaa   01jan2021 11:00:00    1.329   1.484       . |
                   |--------------------------------------------------------|
                6. |      aaa   01jan2021 12:00:00    1.329   1.484       . |
                7. |      aaa   01jan2021 13:00:00    1.329   1.484       . |
                8. |      aaa   01jan2021 14:00:00    1.329   1.484   1.595 |
                9. |      aaa   01jan2021 15:00:00    1.329   1.484   1.595 |
               10. |      aaa   01jan2021 16:00:00    1.329   1.484   1.595 |
                   |--------------------------------------------------------|
               11. |      aaa   01jan2021 17:00:00    1.329   1.484   1.595 |
               12. |      aaa   01jan2021 18:00:00    1.329   1.484   1.595 |
               13. |      aaa   01jan2021 19:00:00    1.329   1.484   1.595 |
               14. |      aaa   01jan2021 20:00:00    1.329   1.484   1.595 |
               15. |      aaa   01jan2021 21:00:00    1.329   1.484   1.595 |
                   |--------------------------------------------------------|
               16. |      aaa   01jan2021 22:00:00    1.329   1.484   1.595 |
               17. |      aaa   01jan2021 23:00:00    1.329   1.484   1.595 |
               18. |      bbb   01jan2021 03:00:00    1.324   1.483       . |
               19. |      bbb   01jan2021 04:00:00    1.324   1.483       . |
               20. |      bbb   01jan2021 05:00:00    1.324   1.483       . |
                   |--------------------------------------------------------|
               21. |      bbb   01jan2021 06:00:00    1.324   1.483       . |
               22. |      bbb   01jan2021 07:00:00    1.324   1.483       . |
               23. |      bbb   01jan2021 08:00:00    1.324   1.483       . |
               24. |      bbb   01jan2021 09:00:00    1.324   1.483       . |
               25. |      bbb   01jan2021 10:00:00    1.324   1.483       . |
                   |--------------------------------------------------------|
               26. |      bbb   01jan2021 11:00:00    1.324   1.483       . |
               27. |      bbb   01jan2021 12:00:00    1.324   1.483       . |
               28. |      bbb   01jan2021 13:00:00    1.324   1.483       . |
               29. |      bbb   01jan2021 14:00:00    1.324   1.483       . |
               30. |      bbb   01jan2021 15:00:00    1.324   1.483       . |
                   |--------------------------------------------------------|
               31. |      bbb   01jan2021 16:00:00    1.324   1.483       . |
               32. |      bbb   01jan2021 17:00:00    1.324   1.483   1.534 |
               33. |      bbb   01jan2021 18:00:00    1.324   1.483   1.534 |
               34. |      bbb   01jan2021 19:00:00    1.326   1.483   1.578 |
               35. |      bbb   01jan2021 20:00:00    1.326   1.483   1.578 |
                   |--------------------------------------------------------|
               36. |      bbb   01jan2021 21:00:00    1.326   1.483   1.578 |
               37. |      bbb   01jan2021 22:00:00    1.326   1.483   1.578 |
               38. |      bbb   01jan2021 23:00:00    1.326   1.483   1.578 |
               39. |      ccc   01jan2021 02:00:00    1.327       .   1.598 |
               40. |      ccc   01jan2021 03:00:00    1.327       .   1.598 |
                   |--------------------------------------------------------|
               41. |      ccc   01jan2021 04:00:00    1.327       .   1.598 |
               42. |      ccc   01jan2021 05:00:00    1.327       .   1.598 |
               43. |      ccc   01jan2021 06:00:00    1.327       .   1.598 |
               44. |      ccc   01jan2021 07:00:00    1.327       .   1.598 |
               45. |      ccc   01jan2021 08:00:00   1.3265   1.532   1.598 |
                   |--------------------------------------------------------|
               46. |      ccc   01jan2021 09:00:00    1.324   1.532   1.598 |
               47. |      ccc   01jan2021 10:00:00    1.324   1.532   1.598 |
               48. |      ccc   01jan2021 11:00:00    1.324   1.532   1.598 |
               49. |      ccc   01jan2021 12:00:00    1.324   1.532   1.598 |
               50. |      ccc   01jan2021 13:00:00    1.324   1.532   1.598 |
                   |--------------------------------------------------------|
               51. |      ccc   01jan2021 14:00:00    1.324   1.532   1.598 |
               52. |      ccc   01jan2021 15:00:00    1.324   1.532   1.598 |
               53. |      ccc   01jan2021 16:00:00    1.324   1.532   1.598 |
               54. |      ccc   01jan2021 17:00:00    1.324   1.532   1.598 |
               55. |      ccc   01jan2021 18:00:00    1.324   1.532   1.598 |
                   |--------------------------------------------------------|
               56. |      ccc   01jan2021 19:00:00    1.324   1.532   1.598 |
               57. |      ccc   01jan2021 20:00:00    1.324   1.532   1.598 |
               58. |      ccc   01jan2021 21:00:00    1.324   1.532   1.598 |
               59. |      ccc   01jan2021 22:00:00    1.324   1.532   1.598 |
               60. |      ccc   01jan2021 23:00:00    1.324   1.532   1.598 |
                   +--------------------------------------------------------+
              A couple of notes:

              1. I fill in gaps before collapsing, because collapsing replaces original prices with means, and you won't be able to fill in missing prices with previous observed prices. For example, after collapsing, line 45 records 1.3265, the average of 1.329 and 1.324. We would not be able to replace consequent missing prices with 1.324 as it is wiped out by collapsing.

              2. I didn't use -tsfill- to fill the gaps for a couple of reasons. First, because I fill in gaps before collapsing, the sample is not a real hourly panel (hours may duplicate) and -tsfill- cannot be used. Second, -tsfill-, even with option -full-, fills gaps based on the earliest and latest hours in the sample (for my example data, it would be from 2AM to 7PM). My code can set an arbitrary ending hour if necessary.

              Comment


              • #8
                Wow, thank you so much for your incredible help and detailed explanation. This was exactly what I needed. I'm super grateful!

                Comment

                Working...
                X