Accessing the group currently active from "by"

Robert Picard

Join Date: Mar 2014

Posts: 1536
#16

20 Feb 2017, 11:05

If the location of each school is determined by geographic coordinates (latitude and longitude), then geonear (from SSC) can quickly find out, in long form, the list of schools within a specific radius. It seems to me that schools do not physically move so I do not see why the list of nearby school would have to be recalculated yearly. Once you have a list of nearby schools, all you need is to merge the annual enrollment for each of these nearby schools and calculate the total enrollment of nearby schools directly, no loops required.

Another approach to a nearest neighbor problem is to use the brute force approach. You form all pairwise combination of schools (per year) and then compute distance and then the desired statistics by group. Per year, this can be done using cross; if the problem is not too large, you can use joinby to do it for all years at once.

The post referenced in #10 includes a data example where the distances have already been computed. The problem is that the distances are recorded in wide form and I assume that there are as many distance variables as there are schools. The example can be morphed into the brute force approach explained in the previous paragraph by reshaping the data to long form. Unfortunately, the reshape command can be quite slow, particularly when reshaping to long.

The following solution uses a more efficient technique to reshape to long. Once you have all pairwise combination of schools per year in long form, you merge enrollment data for nearby schools and then add up enrollment for schools within 25km. A running sum() is more efficient than egen total(). The enrollment of the current school is removed at the end.

Code:

* example from http://www.statalist.org/forums/forum/general-stata-discussion/general/1308490-complex-matching-on-rows-and-collumms clear input float(school_code year kids dist_to_1 dist_to_2 dist_to_3 kids_in_nearby_schools) 1 1 34 0 20 40 . 1 2 42 0 20 40 . 1 3 21 0 20 40 . 2 1 11 20 0 20 . 2 2 23 20 0 20 . 2 3 31 20 0 20 . 3 1 17 40 20 0 . 3 2 19 40 20 0 . 3 3 36 40 20 0 . end isid year school_code, sort list, sepby(year) tempfile master save "`master'" * variables to reshape to long unab vlong: dist_to_* * data on number of kids per school for merging later on keep school_code year kids rename (school_code kids) (school_near kids_near) tempfile kids save "`kids'" * reshape to long; use a more efficient method than -reshape- local n 0 foreach v of local vlong { use school_code year kids `v' using "`master'", clear rename `v' dist_near local id = subinstr("`v'", "dist_to_","", 1) gen school_near = `id' local n = `n' + 1 tempfile hold`n' save "`hold`n''" } clear forvalues i = 1 / `n' { dis "hold`i'" append using "`hold`i''" } * merge kid counts for nearby schools merge m:1 year school_near using "`kids'", assert(match) nogen * add up enrollment for neighbor schools using a running sum * and keep the last observation isid year school_code school_near, sort by year school_code: gen wanted = sum(kids_near / (dist_near <= 25)) by year school_code: keep if _n == _N replace wanted = wanted - kids isid school_code year, sort list, sepby(school_code)
1 like
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#17

20 Feb 2017, 11:06

Nick - 10 years, ~3000 cities
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#18

20 Feb 2017, 14:30

If you have about 3000 schools and 10 years of data, you can still compute within a reasonable period of time the solution using a brute force approach. Here's an example using fake data. To make it realistic, I assume that some schools do not exist at the start of the data and some schools close before the end of the data window. School location does not change. I use geodist (from SSC) to compute distances. The following example runs in about 5 minutes on my computer.

Code:

clear all
set seed 31231
set obs 3000
gen long school = _n
gen double lat = runiformint(0,5) + runiform()
gen double lon = runiformint(0,5) + runiform()
save "school_locations.dta", replace

* 10 years of school enrollment data, some schools have shorter spans
use "school_locations.dta"
gen enroll = runiformint(50,1000)
gen year1 = runiformint(1900, 2005)
replace year1 = 2001 if year1 < 2001
gen year2 = cond(runiform() < .01, runiformint(2006, 2009), 2010)
gen span = year2 - year1 + 1
tab span
expand span
bysort school: gen year = year1 + _n - 1
bysort school: replace enroll = int(enroll + .1 * runiform() * enroll)
keep school enroll year lat lon
isid year school, sort
save "enroll.dta", replace

timer on 1

* form all pairwise combinations of schools per year
rename (school lat lon enroll) (school0 lat0 lon0 enroll0) 
joinby year using "enroll.dta"

* calculate the distance and reduce to schools within 25km
geodist lat0 lon0 lat lon, gen(d) sphere
drop if d > 25

replace enroll = . if school == school0
bysort year school0 (d school): egen enroll_near = total(enroll)
by year school0: keep if _n == 1
keep school0 lat0 lon0 enroll0 year enroll_near
rename (school0 lat0 lon0 enroll0) (school lat lon enroll)
save "joinby.dta", replace

timer off 1
timer list

If you have more schools, the time to perform the joinby will become unpalatable. In that case, you can use geonear (from SSC) to find the neighbor schools within 25km. The only difficulty is that the current version of geonear does not support a by() option, which means that you have to perform the computations per year. That requires a bit more data management gymnastics but since geonear is super fast, you get there much faster and you can tackle problems that are significantly bigger than the one at hand. Here's how to get exactly the same results as above using geonear. The following runs in a little over 7 seconds on my computer.

Code:

clear
set seed 31231
set obs 3000
gen long school = _n
gen double lat = runiformint(0,5) + runiform()
gen double lon = runiformint(0,5) + runiform()
save "school_locations.dta", replace

* 10 years of school enrollment data, some schools have shorter spans
use "school_locations.dta"
gen enroll = runiformint(50,1000)
gen year1 = runiformint(1900, 2005)
replace year1 = 2001 if year1 < 2001
gen year2 = cond(runiform() < .01, runiformint(2006, 2009), 2010)
gen span = year2 - year1 + 1
tab span
expand span
bysort school: gen year = year1 + _n - 1
bysort school: replace enroll = int(enroll + .1 * runiform() * enroll)
keep school enroll year lat lon
isid year school, sort
save "enroll.dta", replace

clear all
timer on 1

* find all schools within 25km per year
forvalues i = 2001/2010 {
    use "enroll.dta" if year == `i'
    tempfile annual
    save "`annual'"
    rename school school0
    geonear school0 lat lon using "`annual'", ///
        neighbors(school lat lon) within(25) long
    gen year = `i'
    merge m:1 year school using "`annual'", assert(match using) ///
        keep(match) keepusing(enroll)
    replace enroll = . if school0 == school
    bysort school0 (km_to_school school): egen enroll_near = total(enroll)
    by school0: keep if _n == 1
    keep year school0 enroll_near
    rename school0 school
    save "enroll_near_`i'.dta", replace
}

clear
forvalues i = 2001/2010 {
    append using "enroll_near_`i'.dta"
}

merge 1:1 school year using "enroll.dta", assert(match) nogen
isid year school, sort
save "geonear.dta", replace

timer off 1
timer list

cf _all using "joinby.dta", all

Comment

Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#19

21 Feb 2017, 06:02

Thanks Robert! though as I mentioned earlier, splitting the data by year worked just as well, and all the calculations were done in about 1.5 hours.
A few questions regarding geodist and geonear though:
since distances (rather than locations) are already given, is there anyway to use geodist or geonear?
if using geodist or geonear in a panel data (for example city-year) the distance variable wouldn't be "km_to_" making it "wide" thus an end result similar to the example I posted in the previous thread with the schools?
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#20

21 Feb 2017, 08:05

I thought from your referenced post in #8 that you were trying to sum up enrollment in nearby schools. I showed you several ways to get there, including one that starts from your already computed distances in wide form. The most efficient approach is to use geonear because it is orders of magnitude faster at solving nearest neighbor problems than any other approach. If you are looking for nearby schools to a school in New York, there's no need to compute distances to schools in California. That's what geonear does and that's why it's so fast. I would think that 7 seconds is better than 1.5 hours but that's your time to waste.

Sorry but I do not understand your additional questions.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment