Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collecting historical climate data from NOAA

    Dear Statalist,

    In my research project, I need to collect the monthly temperature and rainfall data from the University of Delaware which is a global gridded dataset. I am using
    Arcmap to match the global data with the shapefile of the country that I want to extract climate data. However, the raw climatic data is from 1900 to 2014, so I have to do this process year by year which takes a very long time.


    I was wondering whether Stata provides any facilities to collect such data. I hope that someone who has experience on working with environmental data can have a solution.

    Thank you!
    Anh

  • #2
    Dear Anh, I am also planning to construct annual rainfall panel data (across all countries and all years). How can ArcMap do that?

    Ho-Chuan (River) Huang
    Stata 17.0, MP(4)

    Comment


    • #3
      You don't give the URL for the data but I'll assume it is from this NOAA page. I downloaded the air temperature (Monthly Mean V4.01) dataset from that page.

      Once extracted, the data contains, for every month, temperature (or rainfall), one observation for each land-based coordinate grid points. Here's a sample of the most recent time slice (Dec. 2014):
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float value str19 datetime float(latitude longitude)
      -29.8 "2014-12-01 00:00:00" 83.25 -43.25
      -29.9 "2014-12-01 00:00:00" 83.25 -42.75
      -32.3 "2014-12-01 00:00:00" 83.25 -39.75
      -29.5 "2014-12-01 00:00:00" 83.25 -38.75
      -29.7 "2014-12-01 00:00:00" 83.25 -38.25
      -29.4 "2014-12-01 00:00:00" 83.25 -37.75
      -29.6 "2014-12-01 00:00:00" 83.25 -37.25
      -29.7 "2014-12-01 00:00:00" 83.25 -36.75
        -30 "2014-12-01 00:00:00" 83.25 -36.25
      -30.4 "2014-12-01 00:00:00" 83.25 -35.75
      end
      and here are the descriptive statistics for the most recent time slice:
      Code:
      . sum 
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
             value |     85,794   -5.637958    20.81772      -52.7       38.2
          datetime |          0
          latitude |     85,794    1.060109    57.84843     -89.75      83.25
         longitude |     85,794    18.43654    88.00834    -179.75     179.75
      
      .
      You can make a quick and dirty map of all grid points in this time slice using
      Code:
      scatter latitude longitude, msize(vtiny)
      and the results will look like:
      Click image for larger version

Name:	map.png
Views:	1
Size:	685.3 KB
ID:	1438295


      The first thing to understand is that the grid points are generated using .5 degree increments and represent raw geographic coordinates (latitude and longitude). These grid points do not change over time so if you want to select data for a particular country, you need to do a spatial join of these grid coordinates with the shapefile of the desired country only once. You can do this in Stata using geoinpoly (from SSC). Assuming that you are interested only in grid points that fall within the USA, you'll need to find a shapefile that describes the boundaries of the country. You can find one here. You then need to convert this shapefile to Stata datasets using something like:
      Code:
      shp2dta using "cb_2016_us_nation_5m/cb_2016_us_nation_5m.shp", data("usa_data.dta") coor("usa_coor.dta")
      With the data for the most recent time slice in memory, you can now perform the spatial join in Stata using:
      Code:
      geoinpoly latitude longitude using "usa_coor.dta"
      The above runs in less than a second on my computer. Grid points not in the USA will have missing polygon identifiers. Again, we can do a quick map with grid points where the identifier is not missing using:
      Code:
      scatter latitude longitude if !mi(_ID), msize(vtiny)
      and the resulting map looks like:

      Click image for larger version

Name:	map_usa.png
Views:	2
Size:	183.5 KB
ID:	1438296


      You can then reduce the dataset to these USA coordinate grid points and save it as a USA matching dataset that you will use to merge with all time slices using something like:
      Code:
      keep if !mi(_ID)
      keep latitude longitude _ID
      save "usa_grid.dta", replace
      Now the hard part. The raw data is only available in netCDF format and there are no Stata tools to read this type of file. I lost a fair amount of time trying to figure out the best way to wrangle a solution for this "simple" task. I managed to get something to work using NOAA's Weather and Climate Toolkit (WCT). I use a Mac so the following instructions are Mac specific. The first step is to download the WCT distribution and unzip it. The distribution includes a wct-export shell script that you can use to extract data. The process is a bit clumsy as each time slice has to be extracted separately. I used the following do-file to generate what they call a <listfile>, a file with a list of input/output/format for each extract:
      Code:
      * Create a list of export files, one per time slice, the format is
      * input file/dir/url,output file/dir/,config-xml-file/url
      clear
      set obs 1380
      gen rawdata = "air.mon.mean.v401.nc"
      gen outputfile = "results/temp_slice" + string(_n-1) + ".csv"
      gen config = "wctBatchConfig.xml:::SLICE=" + string(_n-1)
      export delim using "multiple_files.txt", novarnames replace
      There are 1380 time slices in the "air.mon.mean.v401.nc" netCDF dataset (you can find this out by printing the metadata using the toolkit's act-viewer tool). The first time slice starts at 0. Here are the first 5 lines of "multiple_files.txt":
      Code:
      air.mon.mean.v401.nc,results/temp_slice0.csv,wctBatchConfig.xml:::SLICE=0
      air.mon.mean.v401.nc,results/temp_slice1.csv,wctBatchConfig.xml:::SLICE=1
      air.mon.mean.v401.nc,results/temp_slice2.csv,wctBatchConfig.xml:::SLICE=2
      air.mon.mean.v401.nc,results/temp_slice3.csv,wctBatchConfig.xml:::SLICE=3
      air.mon.mean.v401.nc,results/temp_slice4.csv,wctBatchConfig.xml:::SLICE=4
      The input file is always "air.mon.mean.v401.nc". The output file to use is in a subdirectory that I called "results" and each slice is saved in a separate csv-formatted file. Each extraction requires a configuration file ("wctBatchConfig.xml") that contains the parameters of the extraction. The ":::SLICE=" construct acts like a global macro to pass on the time slice number to use. Before you can start the extraction, you need to adapt the "wctBatchConfig.xml" that is distributed with the tools to the specific needs at hand. You need to make the 3 following changes:
      Code:
      <variable name="Total_precipitation"/>
      to
      Code:
      <variable name="air"/>
      The second change:
      Code:
      <time    lookupType="index" lookup="0"  />
      to
      Code:
      <time    lookupType="index" lookup="${SLICE}"  />
      The third change
      Code:
               <minLat>NONE</minLat>
               <maxLat>NONE</maxLat>
               <minLon>NONE</minLon>
               <maxLon>NONE</maxLon>
      to
      Code:
               <minLat>-90.0</minLat>
               <maxLat>90.0</maxLat>
               <minLon>-360</minLon>
               <maxLon>360</maxLon>
      You then save this modified version in the directory that contains the dataset you want to extract from. Finally, you start the Terminal application and change the directory to the one that contains the "air.mon.mean.v401.nc" dataset, the "wctBatchConfig.xml" configuration file, and "multiple_files.txt". You must make sure that the "results" subdirectory exists and then run the following command:
      Code:
       /Users/robert/Documents/temp/wct-4.1.0/wct-export multiple_files.txt csv
      where /Users/robert/Documents/temp/wct-4.1.0/wct-export is the full path to the wct-export tool. The tool will do its thing and extract the full series of time slices, this takes around 3 minutes on my computer.

      To reduce to datasets with only USA grid points, you need a simple loop over each time slice. I'm putting the Stata datasets into a a new subdirectory called​​​​ "results_usa"​​​:
      Code:
      clear
      qui forvalues y = 0/1379 {
          import delimited using "results/temp_slice`y'.csv", clear
          merge 1:1 latitude longitude using "usa_grid.dta", assert(master match) keep(match) nogen
          gen date = daily(datetime, "YMDhms")
          format %td date
          drop datetime _ID
          compress
          save "results_usa/temp_slice`y'.dta", replace
      }
      Finally, you can combine all of these into a single Stata dataset using another loop:
      Code:
      clear
      qui forvalues y = 0/1379 {
          append using "results_usa/temp_slice`y'.dta"
      }
      save "temperature_usa_all.dta", replace
      This produces a dataset with 6,069,240 observations. If anyone has a better angle on how to extract this rectangular data from the raw netCDF dataset, please chime in.

      Comment


      • #4
        Dear Robert, Many thanks for your this helpful post. I am trying to follow your suggestion step by step. First, I download the file `air.mon.mean.v401.nc' (236,148KB). Next, I have trouble in importing this file into Stata. Could you kindly give some advices on this issue. Thanks.
        Ho-Chuan (River) Huang
        Stata 17.0, MP(4)

        Comment


        • #5
          As I wrote in #3, that's the hard part and I gave detailed instructions on how I did it. I do not know what you have tried and where you got in trouble so how am I supposed to give additional advice?

          Comment


          • #6
            Hi Robert,

            Thank you for your suggestion in #3! I tried to replicate your example and found some issues:
            1.
            Code:
             
             shp2dta using "cb_2016_us_nation_5m/cb_2016_us_nation_5m.shp", data("usa_data.dta") coor("usa_coor.dta")  
             geoinpoly latitude longitude using "usa_coor.dta"
            The geoinpoly did not work with the error "variable latitude not found". I then renamed _X and _Y in the "usa_coor.dta" into longitude and latitude, respectively. But there was another error "polygon identifier _ID already exists".

            2. I have modified the "wctBatchConfig.xml" and run the wct-export tool using DOS (I use Windows), but it did not work. Below is the command (which is similar to the one you used).
            Code:
             
             /Users/anh/Documents/Downloads/wct-4.1.0/wct-export multiple_files.txt csv
            I hope that you can give some more advices on the issues above.
            Thank you!

            Anh

            Comment


            • #7
              You need to first load one of the time slice in memory (which contains the variables latitude and longitude) and then call geoinpoly. You should not alter the shapefile's coordinates dataset.

              Sorry but I'm not a Windows user and I do not have access to a PC so I can't help with how to get the wct-export tool to work on a PC.

              Comment


              • #8
                Dear Robert, My bad. I follow your instruction but encounter same problem (1.) as Anh Trinh. I dn not quite understand what you meant by "first loading one of the time slice in memory"? What should I do for that? Thanks.

                Ho-Chuan (River) Huang
                Stata 17.0, MP(4)

                Comment


                • #9
                  Load the weather data for one time period (any one, it does not matter). The data in memory will contains 85,794 observations, one for each grid point coordinates (latitude and longitude). You only need to perform the spatial join once since all other time periods have exactly the same number of observations as they report the temperature/rainfall for the same grid points.
                  Code:
                  . import delimited results/temp_slice0.csv, clear
                  (4 vars, 85,794 obs)
                  
                  . sum
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                         value |     85,794   -8.415361    22.19474      -60.3       37.3
                      datetime |          0
                      latitude |     85,794    1.060109    57.84843     -89.75      83.25
                     longitude |     85,794    18.43654    88.00834    -179.75     179.75
                  
                  . tab datetime
                  
                             datetime |      Freq.     Percent        Cum.
                  --------------------+-----------------------------------
                  1900-01-01 00:00:00 |     85,794      100.00      100.00
                  --------------------+-----------------------------------
                                Total |     85,794      100.00
                  
                  . dataex in 1/5
                  
                  ----------------------- copy starting from the next line -----------------------
                  [CODE]
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float value str19 datetime float(latitude longitude)
                  -35.4 "1900-01-01 00:00:00" 83.25 -43.25
                  -35.5 "1900-01-01 00:00:00" 83.25 -42.75
                  -37.8 "1900-01-01 00:00:00" 83.25 -39.75
                  -34.9 "1900-01-01 00:00:00" 83.25 -38.75
                  -35.1 "1900-01-01 00:00:00" 83.25 -38.25
                  end
                  [/CODE]
                  ------------------ copy up to and including the previous line ------------------
                  
                  Listed 5 out of 85794 observations
                  
                  . geoinpoly latitude longitude using "usa_coor.dta"
                  
                  . keep if !mi(_ID)
                  (81,396 observations deleted)
                  
                  . scatter latitude longitude, msize(vtiny)
                  
                  . graph export "us_grid.png", width(800) replace
                  (file us_grid.png written in PNG format)
                  
                  . keep latitude longitude _ID
                  
                  
                  . save "usa_grid.dta", replace
                  file usa_grid.dta saved
                  
                  .
                  and the graph is the same as I posted before
                  Click image for larger version

Name:	us_grid.png
Views:	2
Size:	183.5 KB
ID:	1438567

                  Comment


                  • #10
                    Since the NOAA datasets I found in #3 are in a difficult format to work with, you are better off getting the data from http://climate.geog.udel.edu/~climate/ as suggested by William Lisowski in this parallel thread. I just posted a solution that can be used to match the full gridded dataset to all countries in the world for all time periods (115 years * 12 months). Since this would produce a very large dataset and since the OP wanted to reduce the data to grid points that fall within the boundaries of a single country, I thought I could post a solution that will do that for just a few country.

                    The first step is to get a shapefile of country boundaries. I'm using one from http://thematicmapping.org/downloads/world_borders.php (by clicking on the "TM_WORLD_BORDERS-0.3.zip" link on that page). I use shp2dta (from SSC) to convert it into Stata datasets. I then create a subset that include only boundaries for France, Spain, and Germany. I use the precipitation dataset (precip_2014.tar.gz, downloaded from the site mentioned in the first paragraph). Once expanded, there's one file per year from 1900 to 2014, each has one observation per grid point, and the precipitation data is in a wide layout, one column per month. Since grid points are the same in each annual dataset, you only need to perform a spatial join once; I used in the example the coordinates from first data year and then matched them to country boundaries using geoinpoly (from SSC).
                    Code:
                    * the shapefile for the world was downloaded from
                    * http://thematicmapping.org/downloads/world_borders.php
                    * and clicking on "TM_WORLD_BORDERS-0.3.zip"
                    shp2dta using "TM_WORLD_BORDERS-0.3/TM_WORLD_BORDERS-0.3.shp", ///
                        data("world_data.dta") coor("world_coor.dta") replace
                        
                    * create a subset that includes only France, Spain, and Germany
                    use "world_data.dta", clear
                    keep if inlist(NAME,"France", "Spain", "Germany")
                    list
                    save "FSG_data.dta", replace
                    use "world_coor.dta", clear
                    gen long obs = _n
                    merge m:1 _ID using "FSG_data.dta", keep(match) keepusing(_ID) nogen
                    sort obs
                    save "FSG_coor.dta", replace
                    
                    
                    * import weather data one time period 
                    infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ///
                        using "precip_2014/precip.1900", clear
                        
                    * reduce to grid coordinates
                    keep lat lon
                    order lat lon
                    
                    * match grid coordinates to countries
                    geoinpoly lat lon using "FSG_coor.dta"
                    
                    * drop grid points that did not match the selected countries
                    drop if mi(_ID)
                    
                    * show the matching grid points
                    scatter lat lon if _ID == 65 , msize(tiny) mcolor(red) ///
                    || ///
                    scatter lat lon if _ID != 65 , msize(tiny) mcolor(blue) ///
                    
                    graph export ".png", width(800) replace
                    save "FSG_grid.dta", replace
                    The results from the list command above:
                    Code:
                    . list
                    
                         +---------------------------------------------------------------------------------------------------+
                         | FIPS   ISO2   ISO3    UN      NAME    AREA    POP2005   REGION   SUBREG~N      LON      LAT   _ID |
                         |---------------------------------------------------------------------------------------------------|
                      1. |   FR     FR    FRA   250    France   55010   60990544      150        155     2.55   46.565    65 |
                      2. |   GM     DE    DEU   276   Germany   34895   82652369      150        155    9.851    51.11    72 |
                      3. |   SP     ES    ESP   724     Spain   49904   43397491      150         39   -3.649   40.227   188 |
                         +---------------------------------------------------------------------------------------------------+
                    and the map of the matched grid points (France in red) :

                    Click image for larger version

Name:	FSG_grid.png
Views:	1
Size:	160.0 KB
ID:	1438613



                    Now that we have a dataset of grid points ("FSG_grid.dta") that fall within the boundaries of our 3 selected countries, we can create a dataset of precipitation data in a long layout (one observation per year month for each grid point) for these countries. The following uses filelist (from SSC) to create a dataset of annual precipitation raw data files. Each annual dataset is processed by runby (from SSC). The input_each_year program reads in the raw annual data and uses merge to reduce the data to grid points in the selected countries. The data is then reshaped to a long layout. The input_each_year program is called for each year in the sample (in this case from 1900 to 1910). Each time it terminates, what's left in memory is considered results and accumulates. When runby finishes its run, all accumulated data are left in memory.

                    Code:
                    clear all
                    
                    * get a list of all the files in the "precip_2014" subdirectory
                    filelist , dir("precip_2014")
                    
                    * extract the year from the file name
                    gen year = real(subinstr(filename,"precip.","",1))
                    assert !mi(year)
                    sum year
                    
                    * select a range a years to process
                    keep if inrange(year, 1900, 1910)
                    
                    program input_each_year
                        // copy file path and year to local macros
                        local filename = dirname + "/" + filename
                        local thisyear = year
                        
                        dis "---------------------------- processing `filename'"
                        infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ///
                            using "`filename'", clear
                        gen int year = `thisyear'
                        
                        // merge each grid point to get the _ID of the matched country
                        merge 1:1 lat lon using "FSG_grid.dta", ///
                            assert(master match) keep(match) nogen
                        
                        // convert to a long layout
                        quietly reshape long p, i(lat lon) j(month)
                        order lat lon year month p
                        isid lat lon year month, sort
                        
                        // what's left at this point is considered results and accumulates
                    end
                    runby input_each_year, by(year) verbose status
                    
                    isid lat lon year month, sort
                    save "precip_FSG_monthly_grid_1900_to_1910.dta", replace

                    Comment


                    • #11
                      Hi Robert,
                      Thanks for your detailed instruction! I can now create a dataset of rainfall/temperature for a particular country.

                      Comment


                      • #12
                        Robert Picard

                        Dear Robert,

                        Just a clarification regarding the procedure above. As far as I understood, The first code allows me to identify some specific countries, while the second one will give me the monthly (or annual with some small change) data for each grid forming the country. Then If i want to generate just one observation per year/country, I can simply take the average of the data reported for each grid. Am I correct?

                        Thanks for your helpful suggestions.

                        Dario

                        Comment


                        • #13
                          Robert Picard

                          No I think I am wrong....How can I match the data from the file "
                          precip_FSG_monthly_grid_1900_to_1910.dta
                          " just to have a single file with countries name and month/year precipitation?

                          Thanks again

                          Dario

                          Comment


                          • #14
                            You need to take the area around each grid point into consideration and properly weigh the data before using collapse on year month _ID. The procedure is the same as the one described in #15 and #18 of this parallel thread.

                            Comment


                            • #15
                              So, it means that I need to download a shape file for each country in my list. Am I right?

                              Comment

                              Working...
                              X