Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Robert may have the makings of a Stata Journal article from a compilation of these recent posts in this and the parallel thread, if he is so inclined.

    In any event, he provides a neat counterexample to the aphorism (common in the USA, at least) that "everybody talks about the weather but nobody does anything about it". Or at least in this case, "does anything with it".

    Comment


    • #17
      Dear Robert,
      1. Thank you so much for this comprehensive illustration for obtaining annual rainfall data (average precipitation per year) in USA.
      2. A related question is that, can we obtain average precipitation per year for each state in the U.S.?
      3. It seems to me, at this moment, no way to calculate the average precipitation per year for each country around the world. Suppose that I can find the area for each country, can I use the total annual rainfall data in #13 (collapse (sum) p, by(_ID year)) divided by the area I found to get the average precipitation per year for each country?
      Ho-Chuan (River) Huang
      Stata 17.0, MP(4)

      Comment


      • #18
        Please read #15 again carefully. You can calculate the average rainfall for any geographical unit you want provided that you weight the value of each grid point using the area around each grid point. The grid points nearest to the poles cover only 13.49 square kilometers while the ones next to the equator cover an area of 3091.05 square kilometers (see the results of sum grid_area in #15).

        Since most of the work is already done, here's a recap applied to finding the average precipitation per year for each country.

        Code:
        * datasets are from #10, #13, and #17
        use "world_grid.dta", clear
        merge 1:1 lat lon using "grid_area.dta", keep(match) nogen
        
        merge 1:m lat lon using "precip_annual_grid_1900_to_2014.dta", ///
            keep(match) nogen
        
        * use the grid area to weigh the precip data
        gen p_area = p * grid_area
        
        collapse (count) pN=p (sum) p_area grid_area, by(year _ID)
        
        * mean precip in millimeters
        gen p_mean = p_area / grid_area
        
        merge m:1 _ID using "world_data.dta", ///
            keepusing(NAME ISO3) assert(match using) keep(match) nogen
        rename (NAME ISO3) (country country3)
        isid country year, sort

        Comment


        • #19
          Dear Robert,
          1. Thanks you so much. It works just fine.
          2. By the way, is it possible (and how) to construct annual rainfall data for each state in the U.S.?
          Ho-Chuan (River) Huang
          Stata 17.0, MP(4)

          Comment


          • #20
            Yup, it's exactly the same procedure as each state has a different _ID.

            Comment


            • #21
              Dear Robert, Thanks again. I will go through the whole procedures to make sure that I understand what's going on.

              Ho-Chuan (River) Huang
              Stata 17.0, MP(4)

              Comment


              • #22
                Hi Robert,

                Thank you for sharing your code for extracting precipitation data. I am using precip_2017.tar available here http://climate.geog.udel.edu/~climat...oad.html#T2017.

                Code:
                clear all
                shp2dta using "TM_WORLD_BORDERS-0.3.shp", data("world_data.dta") coor("world_coor.dta") replace
                
                * import weather data one time period 
                infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 using "precip.2017", clear
                    
                * reduce to grid coordinates
                keep lat lon
                order lat lon
                
                keep if lat >= -90 & lat <= 90
                keep if lon >= -180 & lon <= 180
                
                * match grid coordinates to countries
                geoinpoly lat lon using "world_coor.dta"
                
                * drop a few grid points that did not match any country
                drop if mi(_ID)
                
                * show the matching grid points
                scatter lat lon, msize(vtiny)
                graph export "world_grid.png", width(800) replace
                save "world_grid.dta", replace

                I first got an error on the latitude and longitude being out of bounds which is why I have restricted it to +-90 and +-180. Is this due to some modification in the original data set?

                Next, the following works fine for me up until the "runby" command in your code #13 above.

                Code:
                filelist , dir("<path>")
                
                * extract the year from the file name
                gen year = real(subinstr(filename,"precip.","",1))
                assert !mi(year)
                sum year
                
                * select a range a years to process
                keep if inrange(year, 2010, 2017)
                
                program input_each_year
                    // copy file path and year to local macros
                    local filename = dirname + "/" + filename
                    local thisyear = year
                    
                    dis "---------------------------- processing `filename'"
                    infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 using "`filename'", clear
                    gen int year = `thisyear'
                    
                    merge 1:1 lat lon using "world_grid.dta", ///
                        assert(master match) keep(match) nogen
                    
                    quietly reshape long p, i(lat lon) j(month)
                    order lat lon year month p
                    isid lat lon year month, sort
                  
                end
                runby input_each_year, by(year) verbose status

                This is the error I'm getting: "variables lat lon do not uniquely identify observations in the master data"

                Any help would be highly appreciated.

                Comment


                • #23
                  The precipitation data uses runs of spaces to separate data points. These are plain text files that can be viewed using any text editor. You can also view the first few lines using the type Stata command:

                  Code:
                  . type "precip_2017/precip.1900", lines(10)
                  -179.750  71.250     0.0     2.3     1.1     3.9     4.9    15.6    11.5    37.0    17.1    29.4    23.1     7.1   153.0
                  -179.750  68.750     9.9     6.0     5.7     9.5     9.7    23.2    21.5    53.8    33.6    50.9    44.9    19.0   287.7
                  -179.750  68.250    10.7     7.5     6.0     9.8     9.0    24.6    22.1    56.0    33.1    52.2    44.1    19.4   294.5
                  -179.750  67.750    15.1    11.7     8.0    11.8     9.0    28.5    26.8    61.2    34.9    54.0    44.4    22.4   327.8
                  -179.750  67.250    21.2    16.8    10.7    14.4     9.1    32.7    32.0    66.9    37.2    56.8    46.3    27.0   371.1
                  -179.750  66.750    19.4    16.9    10.5    13.6    10.6    36.6    38.4    77.1    39.6    59.8    45.1    24.3   391.9
                  -179.750  66.250    14.7    15.8    10.0    12.9    14.1    42.6    50.0    93.2    43.8    64.5    43.6    18.7   423.9
                  -179.750  65.750    12.4    14.5     7.5     9.7     9.3    38.0    38.9    86.9    37.5    61.4    37.4    15.1   368.6
                  -179.750  65.250    13.5    13.1     8.2     9.2     7.6    35.7    32.0    81.9    31.6    63.6    36.7    15.6   348.7
                  -179.750 -16.750   381.8   342.0   400.4   210.4    40.4   187.6    64.2   243.3     0.0   106.1   200.7   124.9  2301.8
                  .
                  The infile Stata command reads "free-format" data. The 2017 version of the precipitation data contains an extra variable compared to the 2014 version. This extra variable stores the sum of 12 monthly data points and as such is redundant and can be discarded. You need to include this new variable for the data to be correctly imported into Stata.

                  The following works with freshly downloaded data:

                  Code:
                  * the shapefile for the world was downloaded from
                  * http://thematicmapping.org/downloads/world_borders.php
                  * and clicking on "TM_WORLD_BORDERS-0.3.zip"
                  shp2dta using "TM_WORLD_BORDERS-0.3/TM_WORLD_BORDERS-0.3.shp", ///
                      data("world_data.dta") coor("world_coor.dta") replace
                  
                  * import weather data one time period 
                  infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ptotal ///
                      using "precip_2017/precip.1900", clear
                      
                  * reduce to grid coordinates
                  keep lat lon
                  order lat lon
                  
                  * match grid coordinates to countries
                  geoinpoly lat lon using "world_coor.dta"
                  
                  * drop a few grid points that did not match any country
                  drop if mi(_ID)
                  
                  * show the matching grid points
                  scatter lat lon, msize(vtiny)
                  graph export "world_grid.png", width(800) replace
                  save "world_grid.dta", replace
                  The following extracts the desired sample:

                  Code:
                  clear all
                  
                  filelist , dir("precip_2017")
                  
                  * extract the year from the file name
                  gen year = real(subinstr(filename,"precip.","",1))
                  assert !mi(year)
                  sum year
                  
                  * select a range a years to process
                  keep if inrange(year, 2010, 2017)
                  
                  program input_each_year
                      // copy file path and year to local macros
                      local filename = dirname + "/" + filename
                      local thisyear = year
                      
                      dis "---------------------------- processing `filename'"
                      infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ptotal using "`filename'", clear
                      gen int year = `thisyear'
                      drop ptotal
                  
                      merge 1:1 lat lon using "world_grid.dta", ///
                          assert(master match) keep(match) nogen
                      
                      quietly reshape long p, i(lat lon) j(month)
                      order lat lon year month p
                      isid lat lon year month, sort
                    
                  end
                  runby input_each_year, by(year) verbose status
                  
                  isid lat lon year month, sort
                  save "precip_monthly_grid_2010_to_2017.dta", replace

                  Comment


                  • #24
                    Dear Robert,

                    This is working perfectly now. Thank you very much for your help and guidance - really appreciate it!

                    Comment


                    • #25
                      Hi Robert,

                      Your code is amazing and it really helps me a lot. I'm using the same data which is precip_2017.tar available on http://climate.geog.udel.edu/~climat...oad.html#T2017. I'm working on weighting data by population density. I'm using gridded population data gpw-v4-population-density-rev11_2010_30_sec_tif.zip available on https://sedac.ciesin.columbia.edu/da...lection/gpw-v4. Do you know how to get population weighted precipitation using stata.

                      Any help would be highly appreciated.

                      Comment


                      • #26
                        Originally posted by Robert Picard View Post
                        I'll assume that you are asking this because you want to benchmark these data with data from other sources. I don't have experience with weather data but I found a national time series for the contiguous US from NOAA from this page. You choose the following options:
                        Code:
                        Parameter: Precipitation
                        Time Scale: 12-Month
                        Month: December
                        Start Year: 1900
                        End Year: 2014
                        Options > Display Base Period 1900 - 2014
                        You can't simply add up the precipitation value for each grid point because the data was matched to grid locations defined using raw geographic coordinates. The problem is that the area around each grid point is not constant due to the fact that lines of longitude converge toward the poles whereas lines of latitude are parallel to each other. The distance between 0.5 degrees of latitude is the same all over the world (arguments for geodist are lat1 lon1 lat2 lone):
                        Code:
                        . geodist 0 0 0.5 0, sphere
                        Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km
                        
                        . geodist 0 99 0.5 99, sphere
                        Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km
                        
                        . geodist 50 33 50.5 33, sphere
                        Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km
                        The distance between 0.5 degrees of longitude depends on the latitude and becomes 0 at the poles:
                        Code:
                        . geodist 0 0 0 0.5, sphere
                        Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km
                        
                        . geodist 45 0 45 0.5, sphere
                        Great-circle distance (haversine formula, radius of 6371km) = 39.313281 km
                        
                        . geodist 90 0 90 0.5, sphere
                        Great-circle distance (haversine formula, radius of 6371km) = 3.404e-15 km
                        So before you can calculate an overall statistic for the US from all these grid points, you must estimate the area around each grid point (the area of the grid point's Voronoi diagram). Around the equator, this will be more or less a square box of 55.6*55.6 square km. On a sphere, the distance as you walk along a line of latitude is proportional to the cosine of the latitude. So the width of the box for a grid point at 45 degrees latitude would be (trig functions require angles in radians):
                        Code:
                        . scalar d2r = c(pi) / 180
                        
                        . dis 0.5 * d2r * cos(45 * d2r) * 6371
                        39.313343
                        which is very close to the "as the crow fly" distance calculated by geodist above.

                        Since this is an interesting exercise, here's how I would replicate NOAA's time series using the gridded dataset. First, you will need a shapefile that you can use to select grid points that fall within the 48 conterminous (contiguous) US states. You can get this from the Census using this link. Once unzipped, you can create the Stata datasets using:
                        Code:
                        * source: http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_state_500k.zip
                        shp2dta using "cb_2017_us_state_500k/cb_2017_us_state_500k.shp", ///
                        data("us_data.dta") coor("us_coor.dta") replace
                        
                        * create a subset that includes only the 48 contiguous us states
                        use "us_data.dta", clear
                        list NAME if inlist(_ID, 17,19,39,42,46,52,56,48)
                        drop if inlist(_ID, 17,19,39,42,46,52,56,48)
                        save "us_cont_data.dta", replace
                        
                        use "us_coor.dta", clear
                        gen long obs = _n
                        drop if inlist(_ID, 17,19,39,42,46,52,56,48)
                        save "us_cont_coor.dta", replace
                        The next step is to calculate the area around each grid point (Voronoi diagram). I don't know how to do this in spherical geometry but you can do an planar approximation using rectangles. The height of each rectangle is the same and corresponds to the distance between two points at the same longitude but .5 degrees of latitude apart. The width of each rectangle is proportional to the cosine of the latitude. As before, the grid points are extracted from one of the time periods.
                        Code:
                        * import weather data one time period
                        infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ///
                        using "precip_2014/precip.1900", clear
                        
                        * reduce to grid coordinates
                        keep lat lon
                        order lat lon
                        isid lat lon, sort
                        
                        * decimal degree to radians constant
                        scalar d2r = c(pi) / 180
                        
                        * distance between .5 degrees of latitude, constant longitude
                        geodist 0 0 .5 0, sphere
                        scalar ydist = r(distance)
                        
                        * distance between .5 degrees of longitude, constant latitude
                        gen xdist = 0.5 * d2r * cos(lat * d2r) * 6371
                        
                        * double-check using shortest path distance
                        gen lon1 = lon - .25
                        gen lon2 = lon + .25
                        geodist lat lon1 lat lon2, sphere gen(dcheck)
                        gen diff = abs(xdist-dcheck)
                        sum diff
                        
                        * the area of the grid point's Voronoi diagram
                        gen grid_area = xdist * ydist
                        sum grid_area
                        
                        keep lat lon grid_area
                        save "grid_area.dta", replace
                        Here are the descriptive statistics (grid_area is in squared km)
                        Code:
                        . gen diff = abs(xdist-dcheck)
                        
                        . sum diff
                        
                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+---------------------------------------------------------
                        diff | 85,794 .0000378 .0000223 6.03e-08 .000069
                        
                        .
                        . * the area of the grid point's Voronoi diagram
                        . gen grid_area = xdist * ydist
                        
                        . sum grid_area
                        
                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+---------------------------------------------------------
                        grid_area | 85,794 1714.6 997.3284 13.48733 3091.049
                        Now you can match these grid points to the shapefile of the 48 conterminous states. Because the shapefile contains data on land area, we can compare the total area as calculated using grid_area with the area of each state from the shapefile.
                        Code:
                        use "grid_area.dta", clear
                        
                        * match grid coordinates to contiguous US states
                        geoinpoly lat lon using "us_cont_coor.dta"
                        
                        * drop grid points that did not match
                        drop if mi(_ID)
                        
                        * show the matching grid points
                        scatter lat lon, msize(vtiny)
                        graph export "us_cont_grid.png", width(800) replace
                        
                        save "us_cont_grid.dta", replace
                        
                        * get the state's area from the shapefile
                        merge m:1 _ID using "us_cont_data.dta", assert(match) keepusing(NAME ALAND) nogen
                        
                        * the land area is in square meters
                        gen alandkm2 = ALAND / 1000^2
                        
                        * calculate grid area per state
                        isid _ID lat lon, sort
                        by _ID: gen one = _n == 1
                        by _ID: egen grid_area_state = total(grid_area)
                        
                        * the difference in % by state
                        gen diff = abs(grid_area_state - alandkm2) / alandkm2 * 100
                        sum diff if one
                        
                        * overall
                        sum alandkm2 if one
                        dis %12.0gc r(sum)
                        
                        sum grid_area
                        dis %12.0gc r(sum)
                        The map of the grid points that matched one of the 48 conterminous state:

                        [ATTACH=CONFIG]n1439098[/ATTACH]

                        and here are the results from the descriptive stats:
                        Code:
                        . sum diff if one
                        
                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+---------------------------------------------------------
                        diff | 48 6.659204 9.395654 .0035655 42.27459
                        
                        .
                        . * overall
                        . sum alandkm2 if one
                        
                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+---------------------------------------------------------
                        alandkm2 | 48 159443.2 121115 2677.998 676641.9
                        
                        . dis %12.0gc r(sum)
                        7,653,274.2
                        
                        .
                        . sum grid_area
                        
                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+---------------------------------------------------------
                        grid_area | 3,266 2378.367 180.2581 2038.089 2795.741
                        
                        . dis %12.0gc r(sum)
                         7,767,746.4
                        The big difference (max 42.27%) when calculated by state is due to Delaware, a very small state. Overall however, the total area are pretty close.

                        We can now extract from the full grid dataset the data for these US grid points and calculate average precipitation per year:
                        Code:
                        use "us_cont_grid.dta", clear
                        merge 1:m lat lon using "precip_annual_grid_1900_to_2014.dta", ///
                        keep(match) nogen
                        
                        * use the grid area to weigh the precip data
                        gen p_area = p * grid_area
                        
                        collapse (count) pN=p (sum) p_area grid_area, by(year)
                        
                        * mean precip in millimeters
                        gen p_mean = p_area / grid_area
                        
                        * mean precip in inches
                        gen p_mean_in = p_mean * 0.03937007874
                        
                        list if year >= 2010
                        and the results:
                        Code:
                        . list if year >= 2010
                        
                        +---------------------------------------------------------+
                        | year pN p_area grid_a~a p_mean p_mean~n |
                        |---------------------------------------------------------|
                        111. | 2010 3263 6.02e+09 7760198 775.2827 30.52294 |
                        112. | 2011 3263 5.80e+09 7760198 746.9092 29.40587 |
                        113. | 2012 3263 5.26e+09 7760198 678.1751 26.69981 |
                        114. | 2013 3263 5.99e+09 7760198 772.0376 30.39518 |
                        115. | 2014 3263 5.92e+09 7760198 762.9057 30.03566 |
                        +---------------------------------------------------------+
                        The following are NOAA's estimate from the web page reference at the top of this post:
                        Code:
                        201001 - 201012 31.37
                        201101 - 201112 30.10
                        201201 - 201212 27.53
                        201301 - 201312 31.06
                        201401 - 201412 30.85
                        Pretty close from my point of view
                        Dear Robert,

                        Thanks a lot for your very detailed explanation and for the code. They were extremely useful!

                        I am working on a project for which I am constructing a dataset of monthly rainfall for districts in India, using the updated version of the database you use in your example (precip_2017.tar.gz), available from http://climate.geog.udel.edu/~climat...oad.html#T2017.

                        I follow your code, step by step, and when I use the grid points and calculate average precipitation per year for India as a whole, using the grid_area variable to weight the value of each grid point using the area around each of them, I get decent estimates, as compared to official sources on average yearly rainfall in the whole country. However, when I go down to the district level the estimates are not so accurate. I was wondering if it has to do with the way in which you perform the spatial join in your code using the geoinpoly command: if I understand correctly, in your code, you are doing the spatial join based on the centroids of the grid, that is the grid points, so in my case, each point would get assigned to 1 or more districts, based on whether the grid point is located in the interior of a district or it hits the boundary between two or more districts. However, this does not take into account that the area of the gridpoint could "contain" more than the district where the centroid is located, and as such is assigning an incorrect amount of rainfall to a district, taking into account your observation on how grid locations are defined using raw geographic coordinates, and the area around each grid point is not constant due to the fact that lines of longitude converge toward the poles whereas lines of latitude are parallel to each other.

                        My question is: is there a way to "draw" the area around the grid point, that you calculate in the grid_area variable, such than when I do the spatial join, I do not overlay just the centroids but the whole polygon (area) around the grid point on the shp file of the disctrict boundaries?

                        Thanks!
                        Click image for larger version

Name:	Screenshot 2022-08-08 121909.jpg
Views:	1
Size:	221.3 KB
ID:	1676883

                        Comment


                        • #27
                          Dear Nicolas,

                          I'm trying to construct a monthly rainfall dataset at the district level in India. I'm also using data the same data but an earlier version, i.e., precip_2014.tar.gz.

                          After reading your post, I was wondering, did you find a solution to get more accurate estimates at the district level? Any advice would be much appreciated!

                          Thank you!

                          Comment

                          Working...
                          X