Geonear command

Priver JM

Join Date: Feb 2019

Posts: 30
#1

Geonear command

09 Aug 2021, 14:29

First off, it is not possible to use dataex to describe my current dataset since there is a data permission issue.
However, I believe that anyone can understand the data format.

I have two separate data files to run "geonear" command. One has total 1643 observation with a unique id (variable named id1) with latitude and longitude (e.g., lat: 29.631 long: -81.7397) and the other data includes 1920 obs with a unique id (variable named id2), lat and longitude. The ids from each data are not overlapped. Here is the code below to identify each id2 within 10km distance from id1.

Code:

geonear id1 lat lon using "houses.dta", n(id2 lat lon) within(10) long

After I ran this code, for whatever reasons, there is only one id2 identified regardless of the specified distance (10km). To be more specific, each observation from id1 (total 1643 obs) is matched to only one of the observation in id2 out of 1920 observation and the all distances between them were around 9000 km, which was not reliable. I do not understand this result. I expected to see every possible pairwise between id1 and id2 within 10 km distance so that I can identify the number of locations per id1. Would a maximum capacity of dataset makes this matter? Thank you.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

09 Aug 2021, 15:44

First of all, note that -geonear- is a community-contributed program available at SSC. As an occasional user and an on-the-record admirer of it, I simulated some data and dug into the help file for -geonear-, in which I found: "In long form mode (see long option), geonear returns one observation per neighbor found. [my emphasis] The nearcount(#) option can be used to request a specific number of neighbors per baseid. The within(#) option can be used to request all neighbors within a distance of # from each baseid. You can combine within(#) with nearcount(#) and geonear will return any nborid that satisfies either condition."

So, -geonear- will return one neighbor even if that neighbor does not meet the within(10) specification. Including -nearcount(0)- should exclude those neighbors you mention with distance 9000 km. However, I wonder if there is an issue with your data, rather than the program. In your situation, I would troubleshoot by creating files with just a few of your original data points, selected so that you *know* that e.g. there will be 3 neighbors within 10 km of (say) id1 == 1, and seeing if -geonear- properly finds the neighbors. And, perhaps first, I would look in my original file (-sort lat lon- would be helpful) to see if there are any neighbors you can find by eye that -geonear- is not finding. Maybe such close neighbors don't commonly exist in your data files? Maybe there is some problem with with the lat/lon variables?

Here's a simulation (middle U.S. lat/lon values) that works well enough, I think, for within(10) and within(100). Without nearcount(0), it returns some neighbors that do not meet within(10).

Code:

clear // simulate data set seed 35530 set obs 1920 gen int id2 = _n gen double lat = 40 + runiform() * 10 gen double lon = -100 + 20 * runiform() tempfile houses save `houses' clear set obs 1643 gen int id1 = _n gen double lat = 40 + runiform() * 10 gen double lon = -100 + 20 * runiform() // geonear id1 lat lon using `houses', n(id2 lat lon) within(10) nearcount(0) long bysort id1: gen int naycount = _N tab naycount summ km_to_id2
2 likes
Comment
Priver JM

Join Date: Feb 2019

Posts: 30
#3

09 Aug 2021, 19:40

I appreciate your help. In fact, the problem was the matter of the sort. After sorting in lat and longitude, it works well. Thank you!

Last edited by Priver JM; 09 Aug 2021, 19:47.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#4

09 Aug 2021, 23:41

Good that you got such a helpful answer. For future reference, please note that the question of data permission is already addressed in the FAQ Advice:

Code:

If your dataset is confidential, then provide a fake example instead.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#5

10 Aug 2021, 05:35

First, following on Priver JM's last posting: I wonder if there's a misunderstanding here regarding -sort-. I can't think of any reason why sorting would change the behavior of -geonear-, and there's no mention of the use of -sort- in its help file. My use of -bysort- was just as a means to count the number of neighbors found for each id1. I'd encourage Priver JM to post some code or otherwise explain how they used -sort- in a way that appeared to make -geonear- work. If there's a bug in -geonear-, tracking it down would be a service. If not, some other issue prevails, perhaps detrimental to the work here.

Second, regarding Nick's comment: I've had the idea that adding an automatic randomizing option to -dataex- could be useful for confidential datasets, since many people asking questions that require its use don't know how to do this themselves. Another possibility would be to include some demonstration code in the -dataex- help file to show how to do create randomized fake data.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#6

10 Aug 2021, 09:55

On the last point in #5: We (Robert Picard and I) did think of something similar, but decided that it would complicate the code and, quite often, create more problems than it solves. The idea that posters should know what is sensitive and needs obscuring and that they should take responsibility for doing that still seems best to me. For one, sometimes it is the particular variables -- not their values -- that people don't want to disclose because they have an idea they want to keep quiet about until they know it's good or they have published.

Besides, the code for dataex is completely public. Nothing our side stops anyone from writing their own variant. The point is not that you use dataex but that thereby you present code that can be used by others to input a data example without struggling to see what someone else's data look like.

I think it is usually easier for people to use random number functions or to type in fictitious data! If people want help, the onus is on them to create examples.
Comment

Priver JM

Join Date: Feb 2019
Posts: 30

10 Aug 2021, 11:39

Following on Mike Lacy's posting, this is a simple procedure how I used -sort- for each data. I also wonder why the -sort- made the situation changed. The following code produced the same result in a simulated data.
Again, what I did was just using -sort- before saving each data.

Code:

* toy school data
clear
set seed 3214132
set obs 5
gen school = _n
gen double lat = runiform()
gen double lon = runiform()
sort lat lon
save "schools.dta", replace

* toy house data
clear
set obs 50
gen house = _n
gen double lat = runiform()
gen double lon = runiform()
sort lat lon
save "houses.dta", replace

* find houses within 15km of each school
* this returns at least one house per school even if it's not within 15km
use "schools.dta", clear
geonear school lat lon using "houses.dta", n(house lat lon) within(15) long

* number of houses
bysort school (house): egen within15 = total(km_to_house <= 15)

list, sepby(school)

* return to one obs per school
by school: keep if _n == 1
drop house km_to_house
list

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

10 Aug 2021, 14:43

For what it's worth, for the example just posted, I get the same result file (listed right after the -geonear- command) whether or not the school and house files are sorted before saving. I guess I don't understand what the problem is.
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment