Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem With Monthly Lag Variables Not Lagging Across New Years

    Dear Statalist Masters,

    My first post here at statalist, after having been a passive reader for several years! Glad to now be a member of the forum flock.

    To the issue: I have a problem with creating lag-variables. I am working with a panel dataset with around 800.000 observations covering satellite data on weather and greenness in Ethiopia for each month of the years 2000-2017. In long format, an individual observation is a pixel (identified by the variable id) in a specific month of a specific year (identified by the variable yearandmonth). A subset looks like this:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(id yearandmonth) float(ndvi_an chirps_an lst_day_an lst_night_an)
    10869164 200001          .  .14173031          .          .
    10869164 200002  168.16663  -.3859761          .          .
    10869164 200003  211.55554  -4.910643    110.667  14.166992
    10869164 200004  238.55554   56.76266 -113.44434 -25.722656
    10869164 200005   445.1666 -10.590395  -28.88867  -38.83301
    10869164 200006  213.16663 -19.063148  155.88867  11.444336
    10869164 200007 -289.05566  13.267944   3.833008  -46.94434
    10869164 200008  303.38867  -42.79756   2.666992  -62.66699
    10869164 200009  148.88867  -13.21985   6.764648  -56.41211
    10869164 200010  530.66675   21.98996 -145.29395  -102.5293
    10869164 200011   489.7058  .48916245  26.706055  -38.29395
    10869164 200012  261.23535   .3373172  10.941406  -49.88281
    10869164 200101  179.35303 -.50386167   8.235352  -71.41211
    10869164 200102  110.16663  -.4940401  32.529297  -52.41211
    10869164 200103   52.55554 -1.1464376   20.66699  -92.83301
    10869164 200104   91.55554 -12.991203  110.55566   54.27734
    10869164 200105 -36.833374 -17.386755  134.11133 -16.833008
    10869164 200106  -49.83337 -10.651672  -44.11133  31.444336
    10869164 200107   385.9443  20.109344  -64.16699  -62.94434
    10869164 200108  1091.3887   29.89163  -61.33301  -59.66699
    10869164 200109   870.8887 -36.556335  -7.235352   30.58789
    10869164 200110  434.66675 -11.603274 -33.293945 -23.529297
    10869164 200111   268.7058  -5.093596  -6.293945 -14.293945
    10869164 200112  277.23535  -.5618868  14.941406   25.11719
    10869164 200201  257.35303    .670423  -29.76465  -34.41211
    10869164 200202  203.16663 -.42307615   -44.4707  -20.41211
    10869164 200203  170.55554   .6062927  -76.33301  -29.83301
    10869164 200204  142.55554 -18.338364  70.555664  21.277344
    10869164 200205  176.16663 -17.451864   67.11133  14.166992
    10869164 200206  112.16663 -23.617104   84.88867  74.444336
    end
    format %tm yearandmonth
    I want to run a regression testing the effect of temperature (lst_day_an and lst_night_an) and precipitation (chirps_an) in the 6 months leading up to and including the month of observation, on greenness (ndvi_an). To do this, I want to create lag-variables for temperature and precipitation for 6 months. Here is an example of the code I use to do this:

    Code:
    /* Generate lagged anomaly variables */
    
    foreach y in 1 2 3 4 5 6 {
    gen chirps_an_lag`y' = l`y'.chirps_an
    }
    ​​​​​​​However (and here comes the issue), when I create the lag-variables, they somehow don’t cross new years, that is, the lag variable does not for instance recognize December of 2005 as a 2 month lag for February 2006. Instead it generates a missing value. This means that I only have a full set of 1-6 month lag variables for the 6 last months of every year for each pixel, which isn’t ideal. Any ideas of how to fix this?

    As always – many thanks for assistance,
    Lars

  • #2
    Welcome to Statalist.

    Your basic problem is that applying a %tm format to your yearandmonth variable does not have the effect of converting the variable to a Stata Internal Format monthly variable. You need a different approach.

    The following advice comes, with slight modification, from an earlier post by Clyde Schechter. I couldn't say it better so I'll post it in full.
    To some extent, we all struggle with date and times in Stata.

    At a very fundamental level, dates are inherently problematic because there are so many different ways of representing them in general use. We get our data sets from a variety of sources, and different sources often use different ways of representing dates. Even the same source often is inconsistent: I've seen plenty of data sets that mix-and-match dates in formats like 2jan2017, 1/2/17, and 20170102 all in the same dataset (and sometimes even in the same string variable)!

    When there are so many different ways of writing dates, and when computations with dates require a regularized, uniform approach, then necessarily the apparatus needed to navigate among the various representations is complicated. It has to be in order to have sufficient flexibility for the task. That's why it seems like Stata has a million different functions for going between different types of date representations. But that makes it hard to remember which function does what, and exactly what the syntax for each one is, even though Stata has taken a pretty systematic approach to the names and syntax it gives these functions.

    I think the fundamental thing that has to be remembered is that any calculations with dates in Stata requires the use of Stata internal format (SIF) dates, and that these SIF dates are counts of the number of time units from 1 Jan 1960 to the given date. (That is, a daily SIF date is the number of days from 1 Jan 1960; a quarterly SIF date is the number of quarters from 1 Jan 1960. A clock SIF datetime is the number of milliseconds from 1 Jan 1960 00:00:00.) So if you have a numeric variable that looks like a date to the human eye, you know right away it can't be right. It has to be a number that generates no immediate brain recognition as a date, and it must be of the right magnitude given the dates being represented and the unit of time involved.

    Chapter 24 (Working with dates and times) of the Stata User's Guide PDF is well written and has lots of examples. But it is, of course, impossible to remember all the details for long. Everyone who uses Stata with any regularity needs to read this chapter, and probably re-read it periodically as well. Fortunately [font=Courier New]help datetime[/font is also very well organized and has lots of internal links to help you quickly track down the right function. So if you are familiar with the general concept and have read Chapter 24 a few times, most of the time you can find what you need in the help file without too much difficulty. But I don't think even the most experienced among us can consistently handle dates without going back to the help files, and sometimes to the manual: we may get really good at handling a few specific types of date representations that come up most often in our work. When we encounter something infrequently, memory just isn't adequate.

    All of that said, Nick Cox authored a program numdate which can be obtained from SSC. It's pretty good at "looking" at both string and human-readable-numeric dates and then figuring out the appropriate transformations for you. It's not full blown artificial intelligence, but it certainly handles a wide variety of cases with relatively little effort.
    So your first step will be to install the numdate command from SSC.
    Code:
    ssc install numdate
    The following example shows how to use it with your data to create a SIF date for subsequent use.
    Code:
    . format %9.0f yearandmonth 
    
    . numdate monthly yrmo = yearandmonth, pattern(YM)
    
    . xtset id yrmo
           panel variable:  id (strongly balanced)
            time variable:  yrmo, 2000m1 to 2002m6
                    delta:  1 month
    
    . gen chirps_an_lag = l.chirps_an
    (1 missing value generated)
    
    . list yearandmonth yrmo chirps_an chirps_an_lag in 7/18, clean abbreviate(20)
    
           yearandmonth      yrmo   chirps_an   chirps_an_lag  
      7.         200007    2000m7    13.26794       -19.06315  
      8.         200008    2000m8   -42.79756        13.26794  
      9.         200009    2000m9   -13.21985       -42.79756  
     10.         200010   2000m10    21.98996       -13.21985  
     11.         200011   2000m11    .4891624        21.98996  
     12.         200012   2000m12    .3373172        .4891624  
     13.         200101    2001m1   -.5038617        .3373172  
     14.         200102    2001m2   -.4940401       -.5038617  
     15.         200103    2001m3   -1.146438       -.4940401  
     16.         200104    2001m4    -12.9912       -1.146438  
     17.         200105    2001m5   -17.38675        -12.9912  
     18.         200106    2001m6   -10.65167       -17.38675
    However, in point of fact, you do not need to create lagged variables.
    Code:
    regress nvdi_an L(0/6)chirps_an ...
    will include seven variables without creating them separately: the month of the observation and the six months leading up. This is the preferred method, see the output of help tsvars for more details.

    Comment


    • #3
      Hi Wiliam,

      I never got back to you about this. Thank you very much for the advice and the code. It worked perfectly.

      Yours,
      Lars

      Comment

      Working...
      X