Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    If the location of each school is determined by geographic coordinates (latitude and longitude), then geonear (from SSC) can quickly find out, in long form, the list of schools within a specific radius. It seems to me that schools do not physically move so I do not see why the list of nearby school would have to be recalculated yearly. Once you have a list of nearby schools, all you need is to merge the annual enrollment for each of these nearby schools and calculate the total enrollment of nearby schools directly, no loops required.

    Another approach to a nearest neighbor problem is to use the brute force approach. You form all pairwise combination of schools (per year) and then compute distance and then the desired statistics by group. Per year, this can be done using cross; if the problem is not too large, you can use joinby to do it for all years at once.

    The post referenced in #10 includes a data example where the distances have already been computed. The problem is that the distances are recorded in wide form and I assume that there are as many distance variables as there are schools. The example can be morphed into the brute force approach explained in the previous paragraph by reshaping the data to long form. Unfortunately, the reshape command can be quite slow, particularly when reshaping to long.

    The following solution uses a more efficient technique to reshape to long. Once you have all pairwise combination of schools per year in long form, you merge enrollment data for nearby schools and then add up enrollment for schools within 25km. A running sum() is more efficient than egen total(). The enrollment of the current school is removed at the end.

    Code:
    * example from http://www.statalist.org/forums/forum/general-stata-discussion/general/1308490-complex-matching-on-rows-and-collumms
    clear
    input float(school_code year kids dist_to_1 dist_to_2 dist_to_3 kids_in_nearby_schools)
    1 1 34 0 20 40 . 
    1 2 42 0 20 40 . 
    1 3 21 0 20 40 . 
    2 1 11 20 0 20 . 
    2 2 23 20 0 20 . 
    2 3 31 20 0 20 . 
    3 1 17 40 20 0 . 
    3 2 19 40 20 0 . 
    3 3 36 40 20 0 . 
    end
    isid year school_code, sort
    list, sepby(year)
    tempfile master
    save "`master'"
    
    * variables to reshape to long
    unab vlong: dist_to_*
    
    * data on number of kids per school for merging later on
    keep school_code year kids
    rename (school_code kids) (school_near kids_near)
    tempfile kids
    save "`kids'"
    
    * reshape to long; use a more efficient method than -reshape-
    local n 0
    foreach v of local vlong {
        use school_code year kids `v' using "`master'", clear
        rename `v' dist_near
        local id = subinstr("`v'", "dist_to_","", 1)
        gen school_near = `id'
        local n = `n' + 1
        tempfile hold`n'
        save "`hold`n''"
    }
    clear
    forvalues i = 1 / `n' {
        dis "hold`i'"
        append using "`hold`i''"
    }
    
    * merge kid counts for nearby schools
    merge m:1 year school_near using "`kids'", assert(match) nogen
    
    * add up enrollment for neighbor schools using a running sum
    * and keep the last observation
    isid year school_code school_near, sort
    by year school_code: gen wanted = sum(kids_near / (dist_near <= 25))
    by year school_code: keep if _n == _N
    replace wanted = wanted - kids
    
    isid school_code year, sort
    list, sepby(school_code)

    Comment


    • #17
      Nick - 10 years, ~3000 cities

      Comment


      • #18
        If you have about 3000 schools and 10 years of data, you can still compute within a reasonable period of time the solution using a brute force approach. Here's an example using fake data. To make it realistic, I assume that some schools do not exist at the start of the data and some schools close before the end of the data window. School location does not change. I use geodist (from SSC) to compute distances. The following example runs in about 5 minutes on my computer.

        Code:
        clear all
        set seed 31231
        set obs 3000
        gen long school = _n
        gen double lat = runiformint(0,5) + runiform()
        gen double lon = runiformint(0,5) + runiform()
        save "school_locations.dta", replace
        
        * 10 years of school enrollment data, some schools have shorter spans
        use "school_locations.dta"
        gen enroll = runiformint(50,1000)
        gen year1 = runiformint(1900, 2005)
        replace year1 = 2001 if year1 < 2001
        gen year2 = cond(runiform() < .01, runiformint(2006, 2009), 2010)
        gen span = year2 - year1 + 1
        tab span
        expand span
        bysort school: gen year = year1 + _n - 1
        bysort school: replace enroll = int(enroll + .1 * runiform() * enroll)
        keep school enroll year lat lon
        isid year school, sort
        save "enroll.dta", replace
        
        timer on 1
        
        * form all pairwise combinations of schools per year
        rename (school lat lon enroll) (school0 lat0 lon0 enroll0) 
        joinby year using "enroll.dta"
        
        * calculate the distance and reduce to schools within 25km
        geodist lat0 lon0 lat lon, gen(d) sphere
        drop if d > 25
        
        replace enroll = . if school == school0
        bysort year school0 (d school): egen enroll_near = total(enroll)
        by year school0: keep if _n == 1
        keep school0 lat0 lon0 enroll0 year enroll_near
        rename (school0 lat0 lon0 enroll0) (school lat lon enroll)
        save "joinby.dta", replace
        
        timer off 1
        timer list
        If you have more schools, the time to perform the joinby will become unpalatable. In that case, you can use geonear (from SSC) to find the neighbor schools within 25km. The only difficulty is that the current version of geonear does not support a by() option, which means that you have to perform the computations per year. That requires a bit more data management gymnastics but since geonear is super fast, you get there much faster and you can tackle problems that are significantly bigger than the one at hand. Here's how to get exactly the same results as above using geonear. The following runs in a little over 7 seconds on my computer.

        Code:
        clear
        set seed 31231
        set obs 3000
        gen long school = _n
        gen double lat = runiformint(0,5) + runiform()
        gen double lon = runiformint(0,5) + runiform()
        save "school_locations.dta", replace
        
        * 10 years of school enrollment data, some schools have shorter spans
        use "school_locations.dta"
        gen enroll = runiformint(50,1000)
        gen year1 = runiformint(1900, 2005)
        replace year1 = 2001 if year1 < 2001
        gen year2 = cond(runiform() < .01, runiformint(2006, 2009), 2010)
        gen span = year2 - year1 + 1
        tab span
        expand span
        bysort school: gen year = year1 + _n - 1
        bysort school: replace enroll = int(enroll + .1 * runiform() * enroll)
        keep school enroll year lat lon
        isid year school, sort
        save "enroll.dta", replace
        
        clear all
        timer on 1
        
        * find all schools within 25km per year
        forvalues i = 2001/2010 {
            use "enroll.dta" if year == `i'
            tempfile annual
            save "`annual'"
            rename school school0
            geonear school0 lat lon using "`annual'", ///
                neighbors(school lat lon) within(25) long
            gen year = `i'
            merge m:1 year school using "`annual'", assert(match using) ///
                keep(match) keepusing(enroll)
            replace enroll = . if school0 == school
            bysort school0 (km_to_school school): egen enroll_near = total(enroll)
            by school0: keep if _n == 1
            keep year school0 enroll_near
            rename school0 school
            save "enroll_near_`i'.dta", replace
        }
        
        clear
        forvalues i = 2001/2010 {
            append using "enroll_near_`i'.dta"
        }
        
        merge 1:1 school year using "enroll.dta", assert(match) nogen
        isid year school, sort
        save "geonear.dta", replace
        
        timer off 1
        timer list
        
        cf _all using "joinby.dta", all

        Comment


        • #19
          Thanks Robert! though as I mentioned earlier, splitting the data by year worked just as well, and all the calculations were done in about 1.5 hours.
          A few questions regarding geodist and geonear though:
          since distances (rather than locations) are already given, is there anyway to use geodist or geonear?
          if using geodist or geonear in a panel data (for example city-year) the distance variable wouldn't be "km_to_" making it "wide" thus an end result similar to the example I posted in the previous thread with the schools?

          Comment


          • #20
            I thought from your referenced post in #8 that you were trying to sum up enrollment in nearby schools. I showed you several ways to get there, including one that starts from your already computed distances in wide form. The most efficient approach is to use geonear because it is orders of magnitude faster at solving nearest neighbor problems than any other approach. If you are looking for nearby schools to a school in New York, there's no need to compute distances to schools in California. That's what geonear does and that's why it's so fast. I would think that 7 seconds is better than 1.5 hours but that's your time to waste.

            Sorry but I do not understand your additional questions.

            Comment

            Working...
            X