Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A "rolling" merge

    I have a variable measured at different times and locations, and some events for which I want the most recent observation of this variable:

    Code:
    ** geo vars
    
    clear
    input /*start*/ str30 loc int value int year
        n           3       2013
        n           4       2014
        m           7       2013
        m           8       2014
    end
    
    save "geo_vars.dta", replace
    
    ** events
    
    clear
    input /*start*/ str30 e_id str30 loc int year
        A           n           2011
        B           m           2013
        C           m           2015
    end
    
    save "event.dta", replace
    I want to find, for each event, the most recent measurement in its location at its time:

    Code:
    clear
    input /*start*/ str30 e_id int value
        A           .
        B           7
        C           8
    end
    Event A has missing data because there is no measurement in 2011 or before; B has its value taken from the corresponding row of geo_vars; and C has the most recent measurement to 2015 (from 2014).

    In R, I can do this with a "rolling join". The author of the R package with that functionality made it sound routine in time series work, so I figure maybe it is available in Stata already but I just haven't found it...? Anyway, for illustration, here's the example in R:

    Code:
    ## note: this is R code, not Stata
    
    library(data.table)
    geo_vars = fread("loc value year
        n           3       2013
        n           4       2014
        m           7       2013
        m           8       2014")
    
    event = fread("e_id loc year
        A           n           2011
        B           m           2013
        C           m           2015")
    
    geo_vars[event, on=c("loc", "year"), roll=TRUE]
    # result:
    #    loc value year e_id
    # 1:   n    NA 2011    A
    # 2:   m     7 2013    B
    # 3:   m     8 2015    C
    My backup plan is to take advantage of the fact that my measurements are in years and just `expand` the data to fill years for which I don't have measurements before merging. Annoyingly, I have multiple variables with different years for their most recent observation:

    Code:
    clear
    input /*start*/ str30 loc str30 varname int value int year
        n           inc         1       2011
        n           inc         2       2012
        n           pop         3       2013
        n           pop         4       2014
        m           inc         5       2011
        m           inc         6       2012
        m           pop         7       2013
        m           pop         8       2014
    end
    This plan for expanding the data just for merging seems kludgy and will be memory-intensive if my data is measured at the day level or more finely. Thanks for any pointers.

  • #2
    Well, until I saw the part about multiple measurements in different years, it looked rather easy:

    Code:
    clear
    input /*start*/ str30 loc int value int year
        n           3       2013
        n           4       2014
        m           7       2013
        m           8       2014
    end
    tempfile geo_vars
    rename year geo_vars_year
    save `geo_vars', replace
    
    ** events
    
    clear
    input /*start*/ str30 e_id str30 loc int year
        A           n           2011
        B           m           2013
        C           m           2015
    end
    
    tempfile event
    save `event', replace
    
    use `event', clear
    joinby loc using `geo_vars', unmatched(master)
    drop if geo_vars_year > year & !missing(geo_vars_year)  // ELIMINATE ANY MEASUREMENTS AFTER EVENT
    by e_id (geo_vars_year), sort: keep if _n == _N // KEEP MOST RECENT
    The fact that you have multiple measures to deal with makes it more complicated. But I think the simplest way to deal with that is to break that measurement data set up into separate files for each measure and just repeat the above separately for each. That is, instead of a single geo_vars file, have one file for each measurement, and do the same thing with it as I showed above.

    Depending on the size of the data sets involved, the -joinby- command can be slow and chew up a lot of memory. A faster alternative is the new -rangejoin- command written by Robert Picard (available from SSC). But to use it, you would have to set some lower bound for the measurement years that could be paired with a given event's year. If that can be done efficiently, -joinby- will be much faster and use a lot less memory. You could, of course, choose that lower bound to be the earliest year instantiated in the entire measurements data set--but if that data set goes a long way back, it won't help that much.


    Comment


    • #3
      Thanks for the pointer to rangejoin, Clyde Schechter . I think I can safely set a lower bound (year-10 below) and so apply it here, even with multiple measures. In the example, it is:

      Code:
      clear
      input /*start*/ str30 loc str30 varname int value int year
          n           inc         1       2011
          n           inc         2       2012
          n           pop         3       2013
          n           pop         4       2014
          m           inc         5       2011
          m           inc         6       2012
          m           pop         7       2013
          m           pop         8       2014
      end
      save "geo_vars.dta", replace
      
      clear
      input /*start*/ str30 e_id str30 loc int year
          A           n           2011
          B           m           2013
          C           m           2015
      end
      save "event.dta", replace
      
      use "geo_vars.dta", clear
      keep varname
      duplicates drop
      cross using "event.dta"
      
      gen lo_year = year - 10
      rangejoin year lo_year year using "geo_vars.dta", by(loc varname)
      
      bys e_id varname (year_U): keep if _n == _N
      drop lo_year
      which gives

      Code:
           +----------------------------------------------+
           | varname   e_id   loc   year   value   year_U |
           |----------------------------------------------|
        1. |     inc      A     n   2011       1     2011 |
        2. |     pop      A     n   2011       .        . |
        3. |     inc      B     m   2013       6     2012 |
        4. |     pop      B     m   2013       7     2013 |
        5. |     inc      C     m   2015       6     2012 |
           |----------------------------------------------|
        6. |     pop      C     m   2015       8     2014 |
           +----------------------------------------------+

      Comment

      Working...
      X