Counting observations within a particular time-interval

Jens Grebnezar

Join Date: May 2019
Posts: 3

Counting observations within a particular time-interval

28 May 2019, 05:49

Dear Stata-enthousiasts,

I'm currently doing some research on paid parking in 3 parking garages
The stata-version I am using is 14.1.
The data runs from 1 januari 2014 up till 31 december 2016. I have rougly 4.5 million observations.
I would like to create a dataset that shows how many cars are parked on each hour of the day (for each day of the dataset), by garage.
Using stata is relatively new to me, but after some long hours of trying myself, I still have some problems creating this dataset.

-The problem-
I want to count the number of cars parked within a one hour interval. However, the cars are often parked for a longer time period than one hour, so they should be counted for multiple hours of the day. When running my code (see below), everything seems to be like I want it to be, however the number of cars at a given hour regularly exceeds the maximum capacity of the garage (for 'centraal' it is 1050).
Is there something wrong with my code that causes the program to count 1 car multiple times in 1 hour, or does my code not make sense at all?

-The data-

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 garage_nm double(start_parking end_parking) float parkingtime_hours
"Centraal" 1704153965000 1704158160000 1.1652777
"Centraal" 1704155760000 1704162842000 1.9672222
"Centraal" 1704155829000 1704171787000  4.432778
"Centraal" 1704156183000 1704169860000  3.799167
"Centraal" 1704156427000 1704175203000  5.215556
"Centraal" 1704156605000 1704171420000  4.115278
"Centraal" 1704156780000 1704168784000 3.3344445
"Centraal" 1704156967000 1704185347000  7.883333
"Centraal" 1704157020000 1704166507000  2.635278
"Centraal" 1704157082000 1704175622000      5.15
"Centraal" 1704157140000 1704161589000 1.2358333
"Centraal" 1704157205000 1704171661000 4.0155554
"Centraal" 1704157261000 1704176469000  5.335556
"Centraal" 1704157326000 1704160861000  .9819444
"Centraal" 1704157441000 1704171240000 3.8330555
"Centraal" 1704157509000 1704162969000 1.5166667
"Centraal" 1704157565000 1704169623000 3.3494444
"Centraal" 1704157684000 1704167528000 2.7344444
"Centraal" 1704157684000 1704167522000  2.732778
"Centraal" 1704157807000 1704170887000 3.6333334
end
format %tc start_parking
format %tc end_parking

I've done the following myself
-the code-

Code:

gen id=_n
expand parkingtime_hours, gen(copies)
bysort id: gen hr = _n - 1
gen double datetime = start_parking + hr*60*60*1000
format datetime %tc
gen date = dofc(datetime)
gen hrs = hh(datetime)
gen cars = 1
collapse (sum) cars, by(garage_nm date hrs)
format date %td

After these steps, the data looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 garage_nm float(date hrs) double cars
"Centraal" 19724  0   22
"Centraal" 19724  1   70
"Centraal" 19724  2   90
"Centraal" 19724  3   70
"Centraal" 19724  4   46
"Centraal" 19724  5   30
"Centraal" 19724  6   48
"Centraal" 19724  7   70
"Centraal" 19724  8   78
"Centraal" 19724  9   80
"Centraal" 19724 10   84
"Centraal" 19724 11   88
"Centraal" 19724 12  102
"Centraal" 19724 13   94
"Centraal" 19724 14  108
"Centraal" 19724 15  118
"Centraal" 19724 16  146
"Centraal" 19724 17  186
"Centraal" 19724 18  188
"Centraal" 19724 19  182
"Centraal" 19724 20  162
"Centraal" 19724 21  142
"Centraal" 19724 22  124
"Centraal" 19724 23   96
"Centraal" 19725  0   64
"Centraal" 19725  1   48
"Centraal" 19725  2   46
"Centraal" 19725  3   46
"Centraal" 19725  4   58
"Centraal" 19725  5   90
"Centraal" 19725  6  146
"Centraal" 19725  7  266
"Centraal" 19725  8  438
"Centraal" 19725  9  578
"Centraal" 19725 10  720
"Centraal" 19725 11  898
"Centraal" 19725 12 1004
"Centraal" 19725 13 1084
"Centraal" 19725 14 1224
"Centraal" 19725 15 1112
"Centraal" 19725 16  810
"Centraal" 19725 17  598
"Centraal" 19725 18  602
"Centraal" 19725 19  536
"Centraal" 19725 20  418
"Centraal" 19725 21  300
"Centraal" 19725 22  212
"Centraal" 19725 23  154
"Centraal" 19726  0  112
"Centraal" 19726  1   96
end
format %td date

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

28 May 2019, 13:25

Well, your question seems ill-posed. The number of cars in the garage changes minute by minute, or even second by second. So it is not meaningful to speak of the number of cars in the garage in a given hour, as this will typically vary within the hour.

The following code will create a data set that, for each garage and each hour, gives you the minimum number of cars in the garage that hour, the maximum number, the mean number, the number at the beginning of the hour, and the number at the end of the hour.

Code:

// GO TO LONG LAYOUT drop parkingtime_hours gen long obs_no = _n reshape long @_parking, i(obs_no) j(event) string rename _parking time // CALCULATE NUMBER OF CARS IN GARAGE AT EACH POINT IN TIME by garage_nm (time), sort: gen cars = sum((event == "start" ) - (event == "end")) // AGGREGATE TIME TO HOURLY "CHUNKS" gen double hour = msofhours(1)*floor(time/msofhours(1)) format hour %tc collapse (min) min_cars = cars (max) max_cars = cars (mean) mean_cars = cars /// (first) cars_at_beginning_of_hour = cars (last) cars_at_end_of_hour = cars, /// by(garage_nm hour) // DEAL WITH HOURS WHERE THERE ARE NO ENTRIES OR EXITS AS THESE DO NOT APPEAR by garage_nm (hour): gen expander = cond(hour[_n+1] != hour + msofhours(1) & _n < _N, 2, 1) expand expander sort garage_nm hour foreach v of varlist *cars* { by garage_nm hour: replace `v' = cars_at_end_of_hour[1] if _n == 2 } by garage_nm hour, sort: replace hour = hour + msofhours(1) if _n == 2

Be aware that in a data set the size you speak of, this is going to take a long time to run. The -reshape- command will be slow, as will the -collapse-. So make sure you have something to keep yourself occupied while this is crunching. (And, while this works fine in your example data, I suggest testing it out on a larger sample, but still very small compared to your entire data set, first in case there are some data problems I did not anticipate.

Thank you for using -dataex- on your very first post.

Added: This code assumes that each garage is empty prior to its chronologically first observation. If that is not true, all of the results will be too small, missing the number of cars that were in the garage before its data began.

Last edited by Clyde Schechter; 28 May 2019, 13:42.
Comment
Jens Grebnezar

Join Date: May 2019

Posts: 3
#3

29 May 2019, 07:20

Dear Clyde Schechter,

I can't thank you enough. My question was indeed ill-posed, but somehow you were able to read my mind. After running your code on just the observations of the first month, I received exactly the type of dataset I wanted. Even though I am very exited by the results, I do have some questions about the code. I could follow what you did in the first three parts (go to long layout, calculate # of cars and aggregate time), however I don't completely understand wat the last part of the code did to my data:

Code:

by garage_nm (hour): gen expander = cond(hour[_n+1] != hour + msofhours(1) & _n < _N, 2, 1) expand expander sort garage_nm hour foreach v of varlist *cars* { by garage_nm hour: replace `v' = cars_at_end_of_hour[1] if _n == 2 } by garage_nm hour, sort: replace hour = hour + msofhours(1) if _n == 2

If you have time, would you like to explain what you did with that part?
Again, thanks for your time and help,

Jens
Comment
Jens Grebnezar

Join Date: May 2019

Posts: 3
#4

29 May 2019, 07:34

After looking more closely to my newly created dataset, I did find a little problem. For some hours, especially in the night around 2/3/4 p.m., I sometimes do not have data (as there were no entries or exits I assume). Is there a way to fix that, by for instance changing a bit in the last part of your code?

Jens
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30147

29 May 2019, 12:17

Your questions in #3 and #4 are actually related. That last part was intended to deal with the question you raise in #4. It created a new observation to cover the next hour if there was a gap with no entries or exits. However, I now realize that it was not quite correct, because it presumed that the gap was only one hour (which was the case in the example data, but, understandably, might not be in the real data.) I believe the following amended code will do it correctly:

Code:

//    GO TO LONG LAYOUT
drop parkingtime_hours
gen long obs_no = _n
reshape long @_parking, i(obs_no) j(event) string
rename _parking time

//    CALCULATE NUMBER OF CARS IN GARAGE AT EACH POINT IN TIME
by garage_nm (time), sort: gen cars = sum((event == "start" ) - (event == "end"))

//    AGGREGATE TIME TO HOURLY "CHUNKS"
gen double hour = msofhours(1)*floor(time/msofhours(1))
format hour %tc
collapse (min) min_cars = cars (max) max_cars = cars (mean) mean_cars = cars ///
    (first) cars_at_beginning_of_hour = cars (last) cars_at_end_of_hour = cars, ///
    by(garage_nm hour)
    
//    DEAL WITH HOURS WHERE THERE ARE NO ENTRIES OR EXITS AS THESE DO NOT APPEAR
by garage_nm (hour): gen expander = (hour[_n+1] - hour)/msofhours(1)
by garage_nm (hour): replace expander = 1 if _n == _N
expand expander
sort garage_nm hour
foreach v of varlist *cars* {
    by garage_nm hour: replace `v' = cars_at_end_of_hour[1] if _n > 1
}
by garage_nm hour, sort: replace hour = hour[_n-1] + msofhours(1) if _n > 1

Announcement

Counting observations within a particular time-interval

Comment

Comment

Comment

Comment