Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transforming raw data based on coordinates into a wide/long format in Stata

    Dear Stata users,

    I have a raw database of climatic data at a regional level that can be found here:

    https://crudata.uea.ac.uk/cru/projec.../recon/#paul05

    I would like to have this data in a Stata wide/long format. I will provide a very basic example of how the data look like and what I would like to accomplish.

    Say the data in raw format look like this:

    Code:
    clear all
    set more off
    
    input ///
    var1 var2 var3 var4
    2000    1    . .    
    0.09    0.40    0.53    0.96
    0.91    0.88    0.74    0.50
    0.68    0.14    0.44    0.29
    0.45    0.65    0.20    0.68
    2001    1 . .        
    0.48    0.08    0.66    0.72
    0.04    0.66    0.91    0.66
    0.79    0.49    0.32    0.26
    0.49    0.81    0.54    0.31
    end
    In this, the first line has information about the year (cell 1, year 2000) and the season (cell 2, season 1). The lines from 2 to 5 have information about specific coordinates. That is, each cell is a region based on coordinates. For example, the second line is for latitude 55.75, the third line for 55.25, the fourth line for 54.75, and the fifth line for 54.25. Furthermore, each column represents a longitude point. Say the first column is for 30.75, the second for 30.25, the third for 29.75, and the fourth for 29.25. Hence, the cell in line 2 and column 1 is the region with coordinates (55.75, 30.75) in the year 2000 (season 1) with the value 0.09.

    The whole process repeats itself. Thus, line 6 introduces another year (year 2001 and again season 1) and the coordinates are the same as before. With coordinates included the data would look like this:

    Code:
    clear all
    set more off
    
    input ///
    A B C D E
    .    30.75    30.25    29.75    29.25
    .    2000    1 . .        
    55.75    0.09    0.40    0.53    0.96
    55.25    0.91    0.88    0.74    0.50
    54.75    0.68    0.14    0.44    0.29
    54.25    0.45    0.65    0.20    0.68
    .    2001    1 . .        
    55.75    0.48    0.08    0.66    0.72
    55.25    0.04    0.66    0.91    0.66
    54.75    0.79    0.49    0.32    0.26
    54.25    0.49    0.81    0.54    0.31
    end
    The aim is to transform the raw data in a friendlier format (such as a wide, or long format, so that they can be used as an input in a software such as ArcGis. I was thinking something like this.

    Code:
    clear all
    set more off
    
    input ///
    Latitude    Longitude    year2000    year2001
    55.75    30.75    0.09    0.48
    55.75    30.25    0.40    0.08
    55.75    29.75    0.53    0.66
    55.75    29.25    0.96    0.72
    55.25    30.75    0.91    0.04
    55.25    30.25    0.88    0.66
    55.25    29.75    0.74    0.91
    55.25    29.25    0.50    0.66
    54.75    30.75    0.68    0.79
    54.75    30.25    0.14    0.49
    54.75    29.75    0.44    0.32
    54.75    29.25    0.29    0.26
    54.25    30.75    0.45    0.49
    54.25    30.25    0.65    0.81
    54.25    29.75    0.20    0.54
    54.25    29.25    0.68    0.31
    
    end
    There is a pattern in regions repeating themselves, so I would assume that one could obtain a friendlier version of the data.
    I would be very grateful if people who have used this data before, or have dealt with a similar problem in the past have some advice to offer.

    Thank you for your time.
    Last edited by Pantelis Kazakis; 05 Apr 2017, 03:50.

  • #2
    So this is typical messy data from a spreadsheet. To get what you want takes several steps. None of them is complicated, but they do involve a few simple tricks. I also note that the final result you are looking for seems to discard the season variable. I'll assume you didn't really mean that. If you did, you can always -drop season- at the end.

    Code:
    clear all
    input ///
    A B C D E
    .    30.75    30.25    29.75    29.25
    .    2000    1 . .        
    55.75    0.09    0.40    0.53    0.96
    55.25    0.91    0.88    0.74    0.50
    54.75    0.68    0.14    0.44    0.29
    54.25    0.45    0.65    0.20    0.68
    .    2001    1 . .        
    55.75    0.48    0.08    0.66    0.72
    55.25    0.04    0.66    0.91    0.66
    54.75    0.79    0.49    0.32    0.26
    54.25    0.49    0.81    0.54    0.31
    end
    
    //    CAPTURE CURRENT SORT ORDER
    gen long obs_no = _n
    
    //    RENAME VARIABLES
    rename A latitude
    foreach v of varlist B C D E  {
        local longitude = string(`v'[1], "%3.2f")
        local longitude: subinstr local longitude "." "_"
        rename `v' var_`longitude'
    }
    drop in 1
    
    //    CREATE SEPARATE VARIABLES FOR YEAR & SEASON
    //    BY COPYING FROM THEIR CURRENT LOCATIONS
    //    AND SPREADING DOWN
    gen year = var_30_75 if mod(_n, 5) == 1
    gen season = var_30_25 if mod(_n, 5) == 1
    sort obs_no
    replace year = year[_n-1] if missing(year)
    replace season = season[_n-1] if missing(season)
    
    //    ELIMINATE OBSERVATIONS CONTAINING THE ORIGINAL
    //    YEAR AND SEASON IN THE WRONG VARIABLES
    drop if mod(_n, 5) == 1
    
    //    RESHAPE LONG
    reshape long var_, i(obs_no) j(longitude) string
    
    //    FIX UP THE LONGITUDE VARIABLE
    replace longitude = subinstr(longitude, "_", ".", .)
    destring longitude, replace
    
    //    FINAL CLEAN-UP
    order year season, first
    drop obs_no
    
    //    RESHAPE WIDE BY YEAR IF YOU REALLY MUST
    reshape wide var_, i(season longitude latitude) j(year)
    That done, I wonder whether you really should do the final -reshape wide-. Most of Stata's commands work with data in fully long layout, and wide layouts like this generally present an obstacle to analysis. Now, it may be that for working with ArcGIS you need it this way--I'm not an ArcGIS user and I don't know. But if you plan further analysis in Stata, the final -reshape- is something you will probably regret.

    Comment


    • #3
      I went to the web site and picked up the winter dataset. Here's how I would convert the data:

      Code:
      * this is the winter data, with season == 13
      import delimited using prec_pauling_wi.txt, clear
      
      * no data in last variable
      assert mi(v141)
      drop v141
      
      * carry over the data year for season 13
      gen int year = v1 if v2 == 13 & mi(v3)
      replace year = year[_n-1] if mi(year)
      drop if v2 == 13 & mi(v3)
      
      * rows represent latitudes, from 70.75N down to 30.25N in .5 degree increments
      gen long obs = _n
      bysort year (obs): gen latitude = 70.75 - (_n - 1) * .5
      by year: assert _N == 82
      
      * reshape to long, use a faster method than -reshape-
      save "master.dta", replace
      
      forvalues j=1/140 {
          use year obs lat v`j' using "master.dta", clear
          rename v`j' measure
          gen j = `j'
          tempfile hold`j'
          save "`hold`j''"
      }
      clear
      forvalues j=1/140 {
          append using "`hold`j''"
      }
      sort year obs j
      
      * longitudes range from 29.75W-39.75E in .5 degree increments
      gen longitude = -29.75 + (j-1) * .5
      drop j
      
      sum
      You just reuse the code to input the other seasons, just change the name of the input file and the occurrences of 13 to 14, 15, and 16 (spring, summer, autumn respectively).
      Last edited by Robert Picard; 05 Apr 2017, 09:25.

      Comment


      • #4
        Thanks Clyde, your code works perfectly once you suppose that the first line and first column have info about coordinates. In fact they do not have (I just made an example before to show you guys what I mean). What I have is only this (the data can be found in the link I provided in the first post):

        Code:
        clear all
        set more off  
        input ///
        var1 var2 var3 var4
        2000    1    . .    
        0.09    0.40    0.53    0.96
        0.91    0.88    0.74    0.50
        0.68    0.14    0.44    0.29
        0.45    0.65    0.20    0.68
        2001    1 . .        
        0.48    0.08    0.66    0.72
        0.04    0.66    0.91    0.66
        0.79    0.49    0.32    0.26
        0.49    0.81    0.54    0.31
        end
        I would prefer the final sample to be in long format, not wide as I want to get some statistics, such as average and standard deviation. I was just asking for a better way for the data to be presented, so I could do the rest of the job myself.

        Once I have the data in a wide or long format, I could add the coordinates myself and then do the rest of the analysis.

        In fact the season is not needed, as there are different files that have the respective information.
        Last edited by Pantelis Kazakis; 05 Apr 2017, 09:26.

        Comment


        • #5
          Dear Robert, much obliged!

          Comment

          Working...
          X