A "rolling" merge

Frank Erickson

Join Date: Jan 2015
Posts: 57

14 Jul 2016, 13:15

I have a variable measured at different times and locations, and some events for which I want the most recent observation of this variable:

Code:

** geo vars

clear
input /*start*/ str30 loc int value int year
    n           3       2013
    n           4       2014
    m           7       2013
    m           8       2014
end

save "geo_vars.dta", replace

** events

clear
input /*start*/ str30 e_id str30 loc int year
    A           n           2011
    B           m           2013
    C           m           2015
end

save "event.dta", replace

I want to find, for each event, the most recent measurement in its location at its time:

Code:

clear
input /*start*/ str30 e_id int value
    A           .
    B           7
    C           8
end

Event A has missing data because there is no measurement in 2011 or before; B has its value taken from the corresponding row of geo_vars; and C has the most recent measurement to 2015 (from 2014).

In R, I can do this with a "rolling join". The author of the R package with that functionality made it sound routine in time series work, so I figure maybe it is available in Stata already but I just haven't found it...? Anyway, for illustration, here's the example in R:

Code:

## note: this is R code, not Stata

library(data.table)
geo_vars = fread("loc value year
    n           3       2013
    n           4       2014
    m           7       2013
    m           8       2014")

event = fread("e_id loc year
    A           n           2011
    B           m           2013
    C           m           2015")

geo_vars[event, on=c("loc", "year"), roll=TRUE]
# result:
#    loc value year e_id
# 1:   n    NA 2011    A
# 2:   m     7 2013    B
# 3:   m     8 2015    C

My backup plan is to take advantage of the fact that my measurements are in years and just `expand` the data to fill years for which I don't have measurements before merging. Annoyingly, I have multiple variables with different years for their most recent observation:

Code:

clear
input /*start*/ str30 loc str30 varname int value int year
    n           inc         1       2011
    n           inc         2       2012
    n           pop         3       2013
    n           pop         4       2014
    m           inc         5       2011
    m           inc         6       2012
    m           pop         7       2013
    m           pop         8       2014
end

This plan for expanding the data just for merging seems kludgy and will be memory-intensive if my data is measured at the day level or more finely. Thanks for any pointers.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

14 Jul 2016, 13:54

Well, until I saw the part about multiple measurements in different years, it looked rather easy:

Code:

clear input /*start*/ str30 loc int value int year n 3 2013 n 4 2014 m 7 2013 m 8 2014 end tempfile geo_vars rename year geo_vars_year save `geo_vars', replace ** events clear input /*start*/ str30 e_id str30 loc int year A n 2011 B m 2013 C m 2015 end tempfile event save `event', replace use `event', clear joinby loc using `geo_vars', unmatched(master) drop if geo_vars_year > year & !missing(geo_vars_year) // ELIMINATE ANY MEASUREMENTS AFTER EVENT by e_id (geo_vars_year), sort: keep if _n == _N // KEEP MOST RECENT

The fact that you have multiple measures to deal with makes it more complicated. But I think the simplest way to deal with that is to break that measurement data set up into separate files for each measure and just repeat the above separately for each. That is, instead of a single geo_vars file, have one file for each measurement, and do the same thing with it as I showed above.

Depending on the size of the data sets involved, the -joinby- command can be slow and chew up a lot of memory. A faster alternative is the new -rangejoin- command written by Robert Picard (available from SSC). But to use it, you would have to set some lower bound for the measurement years that could be paired with a given event's year. If that can be done efficiently, -joinby- will be much faster and use a lot less memory. You could, of course, choose that lower bound to be the earliest year instantiated in the entire measurements data set--but if that data set goes a long way back, it won't help that much.
1 like
Comment

Frank Erickson

Join Date: Jan 2015
Posts: 57

14 Jul 2016, 15:14

Thanks for the pointer to rangejoin, Clyde Schechter . I think I can safely set a lower bound (year-10 below) and so apply it here, even with multiple measures. In the example, it is:

Code:

clear
input /*start*/ str30 loc str30 varname int value int year
    n           inc         1       2011
    n           inc         2       2012
    n           pop         3       2013
    n           pop         4       2014
    m           inc         5       2011
    m           inc         6       2012
    m           pop         7       2013
    m           pop         8       2014
end
save "geo_vars.dta", replace

clear
input /*start*/ str30 e_id str30 loc int year
    A           n           2011
    B           m           2013
    C           m           2015
end
save "event.dta", replace

use "geo_vars.dta", clear
keep varname
duplicates drop
cross using "event.dta"

gen lo_year = year - 10
rangejoin year lo_year year using "geo_vars.dta", by(loc varname)

bys e_id varname (year_U): keep if _n == _N
drop lo_year

which gives

Code:

     +----------------------------------------------+
     | varname   e_id   loc   year   value   year_U |
     |----------------------------------------------|
  1. |     inc      A     n   2011       1     2011 |
  2. |     pop      A     n   2011       .        . |
  3. |     inc      B     m   2013       6     2012 |
  4. |     pop      B     m   2013       7     2013 |
  5. |     inc      C     m   2015       6     2012 |
     |----------------------------------------------|
  6. |     pop      C     m   2015       8     2014 |
     +----------------------------------------------+

Announcement

A "rolling" merge

Comment

Comment