Construct annual rainfall panel data (across countries and years)

William Lisowski

Join Date: Dec 2014

Posts: 10150
#16

12 Apr 2018, 13:19

Robert may have the makings of a Stata Journal article from a compilation of these recent posts in this and the parallel thread, if he is so inclined.

In any event, he provides a neat counterexample to the aphorism (common in the USA, at least) that "everybody talks about the weather but nobody does anything about it". Or at least in this case, "does anything with it".
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#17

13 Apr 2018, 02:02

Dear Robert,
Thank you so much for this comprehensive illustration for obtaining annual rainfall data (average precipitation per year) in USA.

A related question is that, can we obtain average precipitation per year for each state in the U.S.?

It seems to me, at this moment, no way to calculate the average precipitation per year for each country around the world. Suppose that I can find the area for each country, can I use the total annual rainfall data in #13 (collapse (sum) p, by(_ID year)) divided by the area I found to get the average precipitation per year for each country?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#18

13 Apr 2018, 07:46

Please read #15 again carefully. You can calculate the average rainfall for any geographical unit you want provided that you weight the value of each grid point using the area around each grid point. The grid points nearest to the poles cover only 13.49 square kilometers while the ones next to the equator cover an area of 3091.05 square kilometers (see the results of sum grid_area in #15).

Since most of the work is already done, here's a recap applied to finding the average precipitation per year for each country.

Code:

* datasets are from #10, #13, and #17 use "world_grid.dta", clear merge 1:1 lat lon using "grid_area.dta", keep(match) nogen merge 1:m lat lon using "precip_annual_grid_1900_to_2014.dta", /// keep(match) nogen * use the grid area to weigh the precip data gen p_area = p * grid_area collapse (count) pN=p (sum) p_area grid_area, by(year _ID) * mean precip in millimeters gen p_mean = p_area / grid_area merge m:1 _ID using "world_data.dta", /// keepusing(NAME ISO3) assert(match using) keep(match) nogen rename (NAME ISO3) (country country3) isid country year, sort
2 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#19

13 Apr 2018, 18:05

Dear Robert,
Thanks you so much. It works just fine.

By the way, is it possible (and how) to construct annual rainfall data for each state in the U.S.?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#20

13 Apr 2018, 19:10

Yup, it's exactly the same procedure as each state has a different _ID.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#21

13 Apr 2018, 20:35

Dear Robert, Thanks again. I will go through the whole procedures to make sure that I understand what's going on.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Mahrukh Khan

Join Date: Jun 2020
Posts: 3

#22

09 Jun 2020, 01:34

Hi Robert,

Thank you for sharing your code for extracting precipitation data. I am using precip_2017.tar available here http://climate.geog.udel.edu/~climat...oad.html#T2017.

Code:

clear all
shp2dta using "TM_WORLD_BORDERS-0.3.shp", data("world_data.dta") coor("world_coor.dta") replace

* import weather data one time period 
infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 using "precip.2017", clear
    
* reduce to grid coordinates
keep lat lon
order lat lon

keep if lat >= -90 & lat <= 90
keep if lon >= -180 & lon <= 180

* match grid coordinates to countries
geoinpoly lat lon using "world_coor.dta"

* drop a few grid points that did not match any country
drop if mi(_ID)

* show the matching grid points
scatter lat lon, msize(vtiny)
graph export "world_grid.png", width(800) replace
save "world_grid.dta", replace

I first got an error on the latitude and longitude being out of bounds which is why I have restricted it to +-90 and +-180. Is this due to some modification in the original data set?

Next, the following works fine for me up until the "runby" command in your code #13 above.

Code:

filelist , dir("<path>")

* extract the year from the file name
gen year = real(subinstr(filename,"precip.","",1))
assert !mi(year)
sum year

* select a range a years to process
keep if inrange(year, 2010, 2017)

program input_each_year
    // copy file path and year to local macros
    local filename = dirname + "/" + filename
    local thisyear = year
    
    dis "---------------------------- processing `filename'"
    infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 using "`filename'", clear
    gen int year = `thisyear'
    
    merge 1:1 lat lon using "world_grid.dta", ///
        assert(master match) keep(match) nogen
    
    quietly reshape long p, i(lat lon) j(month)
    order lat lon year month p
    isid lat lon year month, sort
  
end
runby input_each_year, by(year) verbose status

This is the error I'm getting: "variables lat lon do not uniquely identify observations in the master data"

Any help would be highly appreciated.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#23

10 Jun 2020, 09:34

The precipitation data uses runs of spaces to separate data points. These are plain text files that can be viewed using any text editor. You can also view the first few lines using the type Stata command:

Code:

. type "precip_2017/precip.1900", lines(10)
-179.750  71.250     0.0     2.3     1.1     3.9     4.9    15.6    11.5    37.0    17.1    29.4    23.1     7.1   153.0
-179.750  68.750     9.9     6.0     5.7     9.5     9.7    23.2    21.5    53.8    33.6    50.9    44.9    19.0   287.7
-179.750  68.250    10.7     7.5     6.0     9.8     9.0    24.6    22.1    56.0    33.1    52.2    44.1    19.4   294.5
-179.750  67.750    15.1    11.7     8.0    11.8     9.0    28.5    26.8    61.2    34.9    54.0    44.4    22.4   327.8
-179.750  67.250    21.2    16.8    10.7    14.4     9.1    32.7    32.0    66.9    37.2    56.8    46.3    27.0   371.1
-179.750  66.750    19.4    16.9    10.5    13.6    10.6    36.6    38.4    77.1    39.6    59.8    45.1    24.3   391.9
-179.750  66.250    14.7    15.8    10.0    12.9    14.1    42.6    50.0    93.2    43.8    64.5    43.6    18.7   423.9
-179.750  65.750    12.4    14.5     7.5     9.7     9.3    38.0    38.9    86.9    37.5    61.4    37.4    15.1   368.6
-179.750  65.250    13.5    13.1     8.2     9.2     7.6    35.7    32.0    81.9    31.6    63.6    36.7    15.6   348.7
-179.750 -16.750   381.8   342.0   400.4   210.4    40.4   187.6    64.2   243.3     0.0   106.1   200.7   124.9  2301.8
.

The infile Stata command reads "free-format" data. The 2017 version of the precipitation data contains an extra variable compared to the 2014 version. This extra variable stores the sum of 12 monthly data points and as such is redundant and can be discarded. You need to include this new variable for the data to be correctly imported into Stata.

The following works with freshly downloaded data:

Code:

* the shapefile for the world was downloaded from
* http://thematicmapping.org/downloads/world_borders.php
* and clicking on "TM_WORLD_BORDERS-0.3.zip"
shp2dta using "TM_WORLD_BORDERS-0.3/TM_WORLD_BORDERS-0.3.shp", ///
    data("world_data.dta") coor("world_coor.dta") replace

* import weather data one time period 
infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ptotal ///
    using "precip_2017/precip.1900", clear
    
* reduce to grid coordinates
keep lat lon
order lat lon

* match grid coordinates to countries
geoinpoly lat lon using "world_coor.dta"

* drop a few grid points that did not match any country
drop if mi(_ID)

* show the matching grid points
scatter lat lon, msize(vtiny)
graph export "world_grid.png", width(800) replace
save "world_grid.dta", replace

The following extracts the desired sample:

Code:

clear all

filelist , dir("precip_2017")

* extract the year from the file name
gen year = real(subinstr(filename,"precip.","",1))
assert !mi(year)
sum year

* select a range a years to process
keep if inrange(year, 2010, 2017)

program input_each_year
    // copy file path and year to local macros
    local filename = dirname + "/" + filename
    local thisyear = year
    
    dis "---------------------------- processing `filename'"
    infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ptotal using "`filename'", clear
    gen int year = `thisyear'
    drop ptotal

    merge 1:1 lat lon using "world_grid.dta", ///
        assert(master match) keep(match) nogen
    
    quietly reshape long p, i(lat lon) j(month)
    order lat lon year month p
    isid lat lon year month, sort
  
end
runby input_each_year, by(year) verbose status

isid lat lon year month, sort
save "precip_monthly_grid_2010_to_2017.dta", replace

Comment

Mahrukh Khan

Join Date: Jun 2020

Posts: 3
#24

16 Jun 2020, 02:14

Dear Robert,

This is working perfectly now. Thank you very much for your help and guidance - really appreciate it!
Comment
Cece Liu

Join Date: Jun 2022

Posts: 1
#25

14 Jun 2022, 06:39

Hi Robert,

Your code is amazing and it really helps me a lot. I'm using the same data which is precip_2017.tar available on http://climate.geog.udel.edu/~climat...oad.html#T2017. I'm working on weighting data by population density. I'm using gridded population data gpw-v4-population-density-rev11_2010_30_sec_tif.zip available on https://sedac.ciesin.columbia.edu/da...lection/gpw-v4. Do you know how to get population weighted precipitation using stata.

Any help would be highly appreciated.
Comment
Nicolas Rosas

Join Date: Aug 2022

Posts: 1
#26

08 Aug 2022, 04:21

Originally posted by Robert Picard View Post

I'll assume that you are asking this because you want to benchmark these data with data from other sources. I don't have experience with weather data but I found a national time series for the contiguous US from NOAA from this page. You choose the following options:

Code:

Parameter: Precipitation Time Scale: 12-Month Month: December Start Year: 1900 End Year: 2014 Options > Display Base Period 1900 - 2014

You can't simply add up the precipitation value for each grid point because the data was matched to grid locations defined using raw geographic coordinates. The problem is that the area around each grid point is not constant due to the fact that lines of longitude converge toward the poles whereas lines of latitude are parallel to each other. The distance between 0.5 degrees of latitude is the same all over the world (arguments for geodist are lat1 lon1 lat2 lone):

Code:

. geodist 0 0 0.5 0, sphere Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km . geodist 0 99 0.5 99, sphere Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km . geodist 50 33 50.5 33, sphere Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km

The distance between 0.5 degrees of longitude depends on the latitude and becomes 0 at the poles:

Code:

. geodist 0 0 0 0.5, sphere Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km . geodist 45 0 45 0.5, sphere Great-circle distance (haversine formula, radius of 6371km) = 39.313281 km . geodist 90 0 90 0.5, sphere Great-circle distance (haversine formula, radius of 6371km) = 3.404e-15 km

So before you can calculate an overall statistic for the US from all these grid points, you must estimate the area around each grid point (the area of the grid point's Voronoi diagram). Around the equator, this will be more or less a square box of 55.6*55.6 square km. On a sphere, the distance as you walk along a line of latitude is proportional to the cosine of the latitude. So the width of the box for a grid point at 45 degrees latitude would be (trig functions require angles in radians):

Code:

. scalar d2r = c(pi) / 180 . dis 0.5 * d2r * cos(45 * d2r) * 6371 39.313343

which is very close to the "as the crow fly" distance calculated by geodist above.

Since this is an interesting exercise, here's how I would replicate NOAA's time series using the gridded dataset. First, you will need a shapefile that you can use to select grid points that fall within the 48 conterminous (contiguous) US states. You can get this from the Census using this link. Once unzipped, you can create the Stata datasets using:

Code:

* source: http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_state_500k.zip shp2dta using "cb_2017_us_state_500k/cb_2017_us_state_500k.shp", /// data("us_data.dta") coor("us_coor.dta") replace * create a subset that includes only the 48 contiguous us states use "us_data.dta", clear list NAME if inlist(_ID, 17,19,39,42,46,52,56,48) drop if inlist(_ID, 17,19,39,42,46,52,56,48) save "us_cont_data.dta", replace use "us_coor.dta", clear gen long obs = _n drop if inlist(_ID, 17,19,39,42,46,52,56,48) save "us_cont_coor.dta", replace

The next step is to calculate the area around each grid point (Voronoi diagram). I don't know how to do this in spherical geometry but you can do an planar approximation using rectangles. The height of each rectangle is the same and corresponds to the distance between two points at the same longitude but .5 degrees of latitude apart. The width of each rectangle is proportional to the cosine of the latitude. As before, the grid points are extracted from one of the time periods.

Code:

* import weather data one time period infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 /// using "precip_2014/precip.1900", clear * reduce to grid coordinates keep lat lon order lat lon isid lat lon, sort * decimal degree to radians constant scalar d2r = c(pi) / 180 * distance between .5 degrees of latitude, constant longitude geodist 0 0 .5 0, sphere scalar ydist = r(distance) * distance between .5 degrees of longitude, constant latitude gen xdist = 0.5 * d2r * cos(lat * d2r) * 6371 * double-check using shortest path distance gen lon1 = lon - .25 gen lon2 = lon + .25 geodist lat lon1 lat lon2, sphere gen(dcheck) gen diff = abs(xdist-dcheck) sum diff * the area of the grid point's Voronoi diagram gen grid_area = xdist * ydist sum grid_area keep lat lon grid_area save "grid_area.dta", replace

Here are the descriptive statistics (grid_area is in squared km)

Code:

. gen diff = abs(xdist-dcheck) . sum diff Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- diff | 85,794 .0000378 .0000223 6.03e-08 .000069 . . * the area of the grid point's Voronoi diagram . gen grid_area = xdist * ydist . sum grid_area Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- grid_area | 85,794 1714.6 997.3284 13.48733 3091.049

Now you can match these grid points to the shapefile of the 48 conterminous states. Because the shapefile contains data on land area, we can compare the total area as calculated using grid_area with the area of each state from the shapefile.

Code:

use "grid_area.dta", clear * match grid coordinates to contiguous US states geoinpoly lat lon using "us_cont_coor.dta" * drop grid points that did not match drop if mi(_ID) * show the matching grid points scatter lat lon, msize(vtiny) graph export "us_cont_grid.png", width(800) replace save "us_cont_grid.dta", replace * get the state's area from the shapefile merge m:1 _ID using "us_cont_data.dta", assert(match) keepusing(NAME ALAND) nogen * the land area is in square meters gen alandkm2 = ALAND / 1000^2 * calculate grid area per state isid _ID lat lon, sort by _ID: gen one = _n == 1 by _ID: egen grid_area_state = total(grid_area) * the difference in % by state gen diff = abs(grid_area_state - alandkm2) / alandkm2 * 100 sum diff if one * overall sum alandkm2 if one dis %12.0gc r(sum) sum grid_area dis %12.0gc r(sum)

The map of the grid points that matched one of the 48 conterminous state:

[ATTACH=CONFIG]n1439098[/ATTACH]

and here are the results from the descriptive stats:

Code:

. sum diff if one Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- diff | 48 6.659204 9.395654 .0035655 42.27459 . . * overall . sum alandkm2 if one Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- alandkm2 | 48 159443.2 121115 2677.998 676641.9 . dis %12.0gc r(sum) 7,653,274.2 . . sum grid_area Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- grid_area | 3,266 2378.367 180.2581 2038.089 2795.741 . dis %12.0gc r(sum) 7,767,746.4

The big difference (max 42.27%) when calculated by state is due to Delaware, a very small state. Overall however, the total area are pretty close.

We can now extract from the full grid dataset the data for these US grid points and calculate average precipitation per year:

Code:

use "us_cont_grid.dta", clear merge 1:m lat lon using "precip_annual_grid_1900_to_2014.dta", /// keep(match) nogen * use the grid area to weigh the precip data gen p_area = p * grid_area collapse (count) pN=p (sum) p_area grid_area, by(year) * mean precip in millimeters gen p_mean = p_area / grid_area * mean precip in inches gen p_mean_in = p_mean * 0.03937007874 list if year >= 2010

and the results:

Code:

. list if year >= 2010 +---------------------------------------------------------+ | year pN p_area grid_a~a p_mean p_mean~n | |---------------------------------------------------------| 111. | 2010 3263 6.02e+09 7760198 775.2827 30.52294 | 112. | 2011 3263 5.80e+09 7760198 746.9092 29.40587 | 113. | 2012 3263 5.26e+09 7760198 678.1751 26.69981 | 114. | 2013 3263 5.99e+09 7760198 772.0376 30.39518 | 115. | 2014 3263 5.92e+09 7760198 762.9057 30.03566 | +---------------------------------------------------------+

The following are NOAA's estimate from the web page reference at the top of this post:

Code:

201001 - 201012 31.37 201101 - 201112 30.10 201201 - 201212 27.53 201301 - 201312 31.06 201401 - 201412 30.85

Pretty close from my point of view

Dear Robert,

Thanks a lot for your very detailed explanation and for the code. They were extremely useful!

I am working on a project for which I am constructing a dataset of monthly rainfall for districts in India, using the updated version of the database you use in your example (precip_2017.tar.gz), available from http://climate.geog.udel.edu/~climat...oad.html#T2017.

I follow your code, step by step, and when I use the grid points and calculate average precipitation per year for India as a whole, using the grid_area variable to weight the value of each grid point using the area around each of them, I get decent estimates, as compared to official sources on average yearly rainfall in the whole country. However, when I go down to the district level the estimates are not so accurate. I was wondering if it has to do with the way in which you perform the spatial join in your code using the geoinpoly command: if I understand correctly, in your code, you are doing the spatial join based on the centroids of the grid, that is the grid points, so in my case, each point would get assigned to 1 or more districts, based on whether the grid point is located in the interior of a district or it hits the boundary between two or more districts. However, this does not take into account that the area of the gridpoint could "contain" more than the district where the centroid is located, and as such is assigning an incorrect amount of rainfall to a district, taking into account your observation on how grid locations are defined using raw geographic coordinates, and the area around each grid point is not constant due to the fact that lines of longitude converge toward the poles whereas lines of latitude are parallel to each other.

My question is: is there a way to "draw" the area around the grid point, that you calculate in the grid_area variable, such than when I do the spatial join, I do not overlay just the centroids but the whole polygon (area) around the grid point on the shp file of the disctrict boundaries?

Thanks!
Comment
Dale Pereira

Join Date: Oct 2022

Posts: 1
#27

30 Oct 2022, 07:41

Dear Nicolas,

I'm trying to construct a monthly rainfall dataset at the district level in India. I'm also using data the same data but an earlier version, i.e., precip_2014.tar.gz.

After reading your post, I was wondering, did you find a solution to get more accurate estimates at the district level? Any advice would be much appreciated!

Thank you!
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment