New version of geoinpoly on SSC

Robert Picard

Join Date: Mar 2014

Posts: 1536
#1

New version of geoinpoly on SSC

30 May 2015, 10:10

Thanks to Kit Baum, a new version of geoinpoly is now available on SSC.

geoinpoly is a new command on SSC that was recently announced here. geoinpoly finds, for geographic locations in memory, polygons from a shapefile that spatially overlay the points. This is a point-in-polygon operation, also known as a spatial join in GIS (geographic information system) terminology.

geoinpoly incorporates the same divide and conquer strategy used in geonear (from SSC) to significantly reduce the number of polygon segments that need to be counted. However, I wasn't completely happy with the sluggish performance of geoinpoly when a small number of points were matched to a very detailed shapefile containing lots of polygons (see this recent example using U.S. Census tracts) so I decided to completely rewrite it in Mata. I'm happy to report a 10-15x increase in performance! It can now match 1,000,000 random points to the Michigan Census Tract shapefile in the example above (made up of over 500+K line segments) in 12 seconds on my computer.

There are no other changes and the new version replicates exactly the results produced by the original version (the exact same algorithm is used).

To update to the new version, type in Stata's command window

Code:

adoupdate geoinpoly, update

or use

Code:

ssc install geoinpoly, replace

If you are a first time user, you can install it using

Code:

ssc install geoinpoly
Tags: join, latitude, longitude, polygon, spatial
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

05 Mar 2019, 09:53

First of all, thanks for creating this command, I've used it many times already and it has been very helpful.

That said, I have a few questions/suggestions:
- It would be great to have a progress tracker? The one in geonear (thank you for that command too!) is rather awesome and once you get used to it, you start missing one in geoinpoly, especially when approaching millions of points & polygons.
- Is there a limit to how many polygons a point can belong to? At some point I ran into a "subscript invalid" error for the geoinpoly_main() call, which was solved when I reduced the size of my polygons (and thus each point belonged to fewer polygons)
- My interest at the moment is calculating how many other restaurants are in a 5km-radius of each restaurant. The way I calculate that now is that I use geocircles (another thank you!) to create the 5km radius polygons for each restaurant, then I use geoinpoly to which polygons each restaurant belongs to. Given that I have 770k US restaurants with many close to one another, this step seems to be taking a very long time. In the end I am not terribly interested in which restaurants are in whose vicinity, just the actual count. Do you know of a way to get this count that is more computationally efficient?

Thanks!
Jesse
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

05 Mar 2019, 11:26

I must admit that this is an unanticipated use of geoinpoly (from SSC) but it should be able to handle overlapping polygons without throwing out an error. Can you send me an example that replicates the error?

On the other hand, if all you want is to count the number of restaurants within a 5km radius, you should be able to get there directly using geonear (from SSC). There was a similar question recently here. It is possible there will be too many matches to do it all in a single call but you could easily break down the problem into smaller groups, using zip codes for example. The following creates a dataset with 200 restaurants in two zip codes (100 in each zip code):

Code:

clear all
set seed 123456
input str5 zip double(zlat zlon)
"48104" 42.2 -83.7
"48105" 42.3 -83.6
end

expand 100
gen long resto = _n
gen double lat = zlat + runiform(-.05, .05)
gen double lon = zlon + runiform(-.05, .05)
save "statalist_ex.dta", replace

Since this is a small problem, you can find the number of restaurants within 5km of each restaurant using a brute force approach where you form all pairwise combinations of restaurants (using the cross command), calculate the distance between each pair using geodist (from SSC), prune the list down to restaurants within 5km, and finally count them:

Code:

use "statalist_ex.dta", clear
keep resto lat lon
rename (*) =0
cross using "statalist_ex.dta"
geodist lat0 lon0 lat lon, gen(d) sphere
keep if d < 5
sort resto0 d resto
by resto0: gen wanted = _N
by resto0: keep if _n == 1
keep *0 wanted
list in 1/10

keep resto0 wanted
save "statalist_res.dta", replace

Here are the results for the first 10 restaurants:

Code:

. list in 1/10

     +------------------------------------------+
     | resto0        lat0         lon0   wanted |
     |------------------------------------------|
  1. |      1   42.167996   -83.730916       41 |
  2. |      2    42.32752   -83.649335       28 |
  3. |      3   42.237668   -83.735044       34 |
  4. |      4   42.215702   -83.746309       41 |
  5. |      5   42.167292   -83.739201       37 |
     |------------------------------------------|
  6. |      6   42.207267   -83.729848       63 |
  7. |      7   42.185725   -83.657358       46 |
  8. |      8   42.240323   -83.731942       34 |
  9. |      9   42.166812    -83.66838       45 |
 10. |     10   42.195759   -83.664808       50 |
     +------------------------------------------+

.

For larger data, the number of pairwise combinations required by the brute force approach can easily exceed what is available even on the most powerful computers but you can get there using geonear (from SSC). Here's how to do the same as above using geonear:

Code:

use "statalist_ex.dta", clear
keep resto lat lon
rename (*) =0
geonear resto0 lat0 lon0 using "statalist_ex.dta", neighbor(resto lat lon) long within(5)
sort resto0 km_to_resto resto
by resto0: gen wanted = _N
by resto0: keep if _n == 1
keep resto0 wanted
list in 1/10

cf _all using "statalist_res.dta", all

The last instruction verifies that the results match those of the brute force approach above. Here are the results for the whole run:

Code:

. use "statalist_ex.dta", clear

. keep resto lat lon

. rename (*) =0

. geonear resto0 lat0 lon0 using "statalist_ex.dta", neighbor(resto lat lon) long within(5)

-------------------------------------------------------------------------------
Unique base locations   = 200            Unique neighbor locations = 200       
Bases * Neighbors       = 40,000         Number of regions         = 4         
Computed distances      = 22,410         Total run time (seconds)  = .026
-------------------------------------------------------------------------------

. sort resto0 km_to_resto resto

. by resto0: gen wanted = _N

. by resto0: keep if _n == 1
(9,800 observations deleted)

. keep resto0 wanted

. list in 1/10

     +-----------------+
     | resto0   wanted |
     |-----------------|
  1. |      1       41 |
  2. |      2       28 |
  3. |      3       34 |
  4. |      4       41 |
  5. |      5       37 |
     |-----------------|
  6. |      6       63 |
  7. |      7       46 |
  8. |      8       34 |
  9. |      9       45 |
 10. |     10       50 |
     +-----------------+

. 
. cf _all using "statalist_res.dta", all
          resto0:  match
          wanted:  match

.

Depending on the desired distance threshold, it is possible that urban areas with high concentration of restaurants will generate too many matches such that the geonear output would be unmanageable. In that case, you can use runby (from SSC) to break down the problem into smaller groups. In the following example, I repeat what was done above by handling each zip code separately. Note that neighboring restaurants came come from any zip code:

Code:

clear all
use "statalist_ex.dta"
keep zip resto lat lon
rename (*) =0

program do_a_zip
    geonear resto0 lat0 lon0 using "statalist_ex.dta", neighbor(resto lat lon) long within(5)
    sort resto0 km_to_resto resto
    by resto0: gen wanted = _N
    by resto0: keep if _n == 1
    keep resto0 wanted
end

runby do_a_zip, by(zip0) verbose
isid resto0, sort
list in 1/10
cf _all using "statalist_res.dta", all

and here are the results:

Code:

. runby do_a_zip, by(zip0) verbose

-------------------------------------------------------------------------------
Unique base locations   = 100            Unique neighbor locations = 200       
Bases * Neighbors       = 20,000         Number of regions         = 2         
Computed distances      = 11,200         Total run time (seconds)  = .012
-------------------------------------------------------------------------------
(4,966 observations deleted)

-------------------------------------------------------------------------------
Unique base locations   = 100            Unique neighbor locations = 200       
Bases * Neighbors       = 20,000         Number of regions         = 2         
Computed distances      = 11,210         Total run time (seconds)  = .013
-------------------------------------------------------------------------------
(4,834 observations deleted)

--------------------------------------
Number of by-groups    =             2
by-groups with errors  =             0
by-groups with no data =             0
Observations processed =           200
Observations saved     =           200
--------------------------------------

. isid resto0, sort
(data now sorted by resto0)

. list in 1/10

     +-----------------+
     | resto0   wanted |
     |-----------------|
  1. |      1       41 |
  2. |      2       28 |
  3. |      3       34 |
  4. |      4       41 |
  5. |      5       37 |
     |-----------------|
  6. |      6       63 |
  7. |      7       46 |
  8. |      8       34 |
  9. |      9       45 |
 10. |     10       50 |
     +-----------------+

. cf _all using "statalist_res.dta", all
          resto0:  match
          wanted:  match

.

With runby, what's left in memory when the program terminates is considered results and accumulates. Once you are sure that you have the problem correctly set up (using a small subset of your data), you should probably remove the verbose option and use the status option when you run this on the whole sample. It will track elapsed time and also provide a very good approximation of the time remaining.

Comment

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#4

06 Mar 2019, 03:39

For some reason I cannot manage to reproduce the "subscript invalid" error, even though it reliably popped up yesterday. The geonear solution works like a charm, only three hours to go! Am I correct in understanding that the runby-version is an elegant way to circumvent memory restrictions? In this context that is not an issue for me, but it sound like the kind of tool that will at some point come in very handy.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

06 Mar 2019, 10:05

No problem if you can't replicate the issue, I put a note to look for the issue if I ever revisit the code. As I said, I never anticipated a situation where each point would have its own polygon which means that in dense areas, a single restaurant could match thousands of polygons.

With nearest neighbors problems, the easiest and most obvious solution is to calculate distances between all points. That works for a sample of 1000 (1,000,000 distances), becomes burdensome with a sample of 10,000 (100,000,000 distances), and downright unfeasible for large samples like yours (592,900,000,000 distances). Obviously, if you are looking for close neighbors of a New York restaurant, there's no reason to calculate distances to restaurants in California. That is the strategy that geonear implements, it splits the data into small geographic areas and only considers points that are near each area for the purposes of calculating distances. This works super well if you need the nearest neighbor or just a few close neighbors. However, when the data is clustered like yours (restaurants in urban areas), the radius you choose may end up matching lots of restaurants in total. My intuition is that if you tried to do this in a single go with geonear with a radius of 5km, you would end up with results in excess of 100M observations (and perhaps much more). To collapse this large dataset to one observation per restaurant will require sorting the data and that's very slow on datasets of that size.

The advantage of using runby in this situation is that since you are only interested in the neighbor count, you find all neighbors for restaurants in a specific zip code and then collapse the data to one observation per restaurant in the zip code. You then move on to the next zip code and do the same thing. Results from each zip code accumulate and in the end you end up with 770K observations. Using this strategy, geonear returns exactly the same set of neighbors but you handle the collapse by zip code, and therefore on a much smaller subset.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#6

07 Mar 2019, 03:31

I ended up with 400M observations. Luckily gcollapse (thank you Mauricio Caceres) can handle such files with relative ease, I think the collapse only took a handful of minutes. Geonear finished running in an hour or two, which is also really impressive - the divide and conquer strategy really does work very well!
Comment

Announcement