Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating daily average and hourly levels from hourly data

    Hi everyone!

    I'm new to Stata and I would like to have some of your advice!

    I have gathered hourly data of pollution emissions from 7 different monitoring stations in the city of Paris, France. I have 24 observations per day for each polluant, from each monitoring station, one per hour of the day, from 01/10/2009 to 01/10/2019. Time is my independent variable. My data set is like shown below:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double DateTime int(no2_pa13 pm25_aut)
    1.5471648e+12  81 105
    1.5471684e+12  86  99
     1.547172e+12  87  91
    1.5471756e+12  89  86
    1.5471792e+12  87  86
    1.5471828e+12  83  87
    1.5471864e+12  87  89
      1.54719e+12  94  95
    1.5471936e+12 102  98
    1.5471972e+12 114  97
    1.5472008e+12 127 102
    1.5472044e+12 145 114
     1.547208e+12 166 126
    1.5472116e+12 161 136
    1.5472152e+12 167 141
    1.5472188e+12 147 137
    1.5472224e+12 119 131
     1.547226e+12  99 129
    1.5472296e+12 116 145
    1.5472332e+12 128 162
    1.5472368e+12 124 159
    1.5472404e+12 126 143
     1.547244e+12 133 143
    1.5472476e+12 152 161
    1.5472512e+12 131 169
    1.5472548e+12 114 153
    1.5472584e+12 110 140
     1.547262e+12 105 138
    1.5472656e+12  93 146
    1.5472692e+12  88 150
    1.5472728e+12  83 149
    1.5472764e+12  75 137
      1.54728e+12  67 127
    1.5472836e+12  65 126
    1.5472872e+12  78 128
    1.5472908e+12  97 134
    1.5472944e+12 104 129
     1.547298e+12  99 127
    1.5473016e+12  97 126
    1.5473052e+12 104 123
    1.5473088e+12 110 120
    1.5473124e+12 116 113
     1.547316e+12 116 112
    1.5473196e+12 115 110
    1.5473232e+12 117  97
    1.5473268e+12 115  89
    1.5473304e+12 104  78
     1.547334e+12  93  79
    1.5473376e+12  83  92
    1.5473412e+12  75  91
    1.5473448e+12  73  88
    1.5473484e+12  69  73
     1.547352e+12  68  67
    1.5473556e+12  75  66
    1.5473592e+12  83  67
    1.5473628e+12  82  61
    1.5473664e+12  94  59
      1.54737e+12   .  58
    1.5473736e+12   .  59
    1.5473772e+12  72  57
    1.5473808e+12  66  49
    1.5473844e+12  65  45
     1.547388e+12  73  43
    1.5473916e+12  81  39
    1.5473952e+12  76  36
    1.5473988e+12  78  34
    1.5474024e+12  77  31
     1.547406e+12  77  28
    1.5474096e+12  72  27
    1.5474132e+12  56  24
    1.5474168e+12  39  22
    1.5474204e+12  32  19
     1.547424e+12  32  16
    1.5474276e+12  35  12
    1.5474312e+12  40  13
    1.5474348e+12  40  15
    1.5474384e+12  45  17
     1.547442e+12  69  17
    1.5474456e+12  83  25
    1.5474492e+12  95  28
    1.5474528e+12  81  26
    1.5474564e+12  77  26
      1.54746e+12  67  29
    1.5474636e+12  65  31
    1.5474672e+12  63  36
    1.5474708e+12  67  38
    1.5474744e+12  71  40
     1.547478e+12  75  42
    1.5474816e+12  81  33
    1.5474852e+12  83  30
    1.5474888e+12  77  32
    1.5474924e+12  58  31
     1.547496e+12  56  34
    1.5474996e+12  53  36
    1.5475032e+12  49  36
    1.5475068e+12  54  34
    1.5475104e+12  40  28
     1.547514e+12  23  22
    1.5475176e+12  27  19
    1.5475212e+12  30  16
    end
    format %tcMonth_dd,_CCYY_HH:MM:SS DateTime
    The format of the date is %tcMonth_dd,_CCYY_HH:MM:SS.
    For instance, no2 is the name of the pollutant, pa13 is the monitoring station's name.

    From that, I would like to create two new variables. I would like to have average daily pollution levels, for each polluant, across all 7 stations, between 01/10/2009 and 01/10/2019. I would also like to have pollution levels across the hours of the day, for each pollutant, across all 7 stations. How can I create these variables from my data set?

    Thanks for your help!

    Guillaume

  • #2
    As with most data analysis and management in Stata, this is most easily done in long layout.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double DateTime int(no2_pa13 pm25_aut)
    1.5471648e+12  81 105
    1.5471684e+12  86  99
     1.547172e+12  87  91
    1.5471756e+12  89  86
    1.5471792e+12  87  86
    1.5471828e+12  83  87
    1.5471864e+12  87  89
      1.54719e+12  94  95
    1.5471936e+12 102  98
    1.5471972e+12 114  97
    1.5472008e+12 127 102
    1.5472044e+12 145 114
     1.547208e+12 166 126
    1.5472116e+12 161 136
    1.5472152e+12 167 141
    1.5472188e+12 147 137
    1.5472224e+12 119 131
     1.547226e+12  99 129
    1.5472296e+12 116 145
    1.5472332e+12 128 162
    1.5472368e+12 124 159
    1.5472404e+12 126 143
     1.547244e+12 133 143
    1.5472476e+12 152 161
    1.5472512e+12 131 169
    1.5472548e+12 114 153
    1.5472584e+12 110 140
     1.547262e+12 105 138
    1.5472656e+12  93 146
    1.5472692e+12  88 150
    1.5472728e+12  83 149
    1.5472764e+12  75 137
      1.54728e+12  67 127
    1.5472836e+12  65 126
    1.5472872e+12  78 128
    1.5472908e+12  97 134
    1.5472944e+12 104 129
     1.547298e+12  99 127
    1.5473016e+12  97 126
    1.5473052e+12 104 123
    1.5473088e+12 110 120
    1.5473124e+12 116 113
     1.547316e+12 116 112
    1.5473196e+12 115 110
    1.5473232e+12 117  97
    1.5473268e+12 115  89
    1.5473304e+12 104  78
     1.547334e+12  93  79
    1.5473376e+12  83  92
    1.5473412e+12  75  91
    1.5473448e+12  73  88
    1.5473484e+12  69  73
     1.547352e+12  68  67
    1.5473556e+12  75  66
    1.5473592e+12  83  67
    1.5473628e+12  82  61
    1.5473664e+12  94  59
      1.54737e+12   .  58
    1.5473736e+12   .  59
    1.5473772e+12  72  57
    1.5473808e+12  66  49
    1.5473844e+12  65  45
     1.547388e+12  73  43
    1.5473916e+12  81  39
    1.5473952e+12  76  36
    1.5473988e+12  78  34
    1.5474024e+12  77  31
     1.547406e+12  77  28
    1.5474096e+12  72  27
    1.5474132e+12  56  24
    1.5474168e+12  39  22
    1.5474204e+12  32  19
     1.547424e+12  32  16
    1.5474276e+12  35  12
    1.5474312e+12  40  13
    1.5474348e+12  40  15
    1.5474384e+12  45  17
     1.547442e+12  69  17
    1.5474456e+12  83  25
    1.5474492e+12  95  28
    1.5474528e+12  81  26
    1.5474564e+12  77  26
      1.54746e+12  67  29
    1.5474636e+12  65  31
    1.5474672e+12  63  36
    1.5474708e+12  67  38
    1.5474744e+12  71  40
     1.547478e+12  75  42
    1.5474816e+12  81  33
    1.5474852e+12  83  30
    1.5474888e+12  77  32
    1.5474924e+12  58  31
     1.547496e+12  56  34
    1.5474996e+12  53  36
    1.5475032e+12  49  36
    1.5475068e+12  54  34
    1.5475104e+12  40  28
     1.547514e+12  23  22
    1.5475176e+12  27  19
    1.5475212e+12  30  16
    end
    format %tcMonth_dd,_CCYY_HH:MM:SS DateTime
    
    //    GENERATE DATE AND HOUR VARIABLES
    gen Date = dofc(DateTime)
    format Date %td
    
    reshape long no2 pm25, i(DateTime) j(station) string
    
    //    AVERAGES FOR EACH POLLUTANT ACROSS ALL STATIONS
    local pollutants no2 pm25
    foreach p of varlist `pollutants' {
        by Date, sort: egen daily_`p' = mean(`p')
        by DateTime, sort: egen hourly_`p' = mean(`p')
    }
    
    reshape wide
    In fact, it is also likely that whatever you plan to do next will also be easiest in long layout. So unless you have a clear reason to go back to wide, I suggest skipping the final -reshape wide- and working with your data in long throughout.

    It is a bit hard to demonstrate the hourly averages across all sites in your example data since you give only a single monitoring station for each pollutant. I assume your real data has more than that. But the code shown will work with any number of stations (and they don't have to be the same for all the pollutants).

    Comment


    • #3
      I thank you very much, Clyde! Your code works very well. I prefer staying in wide display though. I have a last question. Indeed, I would like to summarize the data I have for each pollutant, across all stations. Is it possible to summarize each pollutant without going to long layout? Again, many thanks!

      Comment


      • #4
        Well I'm not sure what you mean when you refer to summarizing "each pollutant" as you now have many variables regarding each pollutant. The hourly observations at a given station are easy: just use the -summarize- command. With the summaries across all stations, it is a little more complicated because you need to avoid double-counting all the repeated observations within a day. For that you can do:

        Code:
        egen flag = tag(Date)
        summarize daily_* if flag

        Comment

        Working...
        X