distance computation using geodist package

George Kariuki

Join Date: Jul 2015
Posts: 93

distance computation using geodist package

22 Mar 2017, 14:09

I have a simple data set with longitudes and latitudes for every observation. I wish to compute distances so I can get distances between households of less than 1 KM neighborhood for a simple analysis on spill over effects of a program intervention. Here is a subsample of the data.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(household_id longitude latitude age) byte gender float(treat spillover)
1377812168   34.37838    -.0773 56 2 0 0
2380826616   34.33859   -.16009 44 1 0 0
1565866906   34.34427   -.18608 49 2 0 0
1380843870   34.32025   -.17002 35 2 0 0
2477843882 34.2900575 -.3020293 21 2 0 0
2471770455 34.3183856 -.1710468 53 1 0 0
2677826567   34.32582    -.1656 32 2 0 0
2780838123 34.2978333 -.2856966 50 1 0 0
2477874129 34.3183612 -.2013775 64 2 0 0
2682855367 34.3398756   -.29708 25 2 0 0
2973803553 34.2818906 -.2580746 34 2 0 0
1287876988   34.29492   -.16718 36 1 0 0
2386773330 34.3095594 -.1632476 28 2 0 0
2778865478  34.323366   -.13397 69 1 0 0
1768766098 34.3550887 -.1689325 27 2 0 0
1082861160 34.3528961 -.3369288 25 2 0 0
2366810741   34.29499   -.18036 34 1 0 0
1376872680 34.2881262 -.3954781 27 2 0 0
1766897177 34.2652793 -.3464681 21 1 0 0
1382836658   34.29842   -.28795 44 1 0 0
end
label values longitude dkrf
label values latitude dkrf
label values age dkrf
label values gender gender
label def gender 1 "1. male", modify
label def gender 2 "2. female", modify

I don't seem to know how to generate this variable containing shortest distances from HH i to j which are obviously neighbors. I am using the

Code:

geodist

package due to Robert Picard (Highly appreciated). Could someone assist please?

Tags: None

Robert Picard

Join Date: Mar 2014
Posts: 1536

22 Mar 2017, 14:41

If the dataset is not too large, then the easiest is to form all pairwise combinations of households and calculate the distances between them. You the reduce the data to those that are within 1km. Note that the distance to self is zero. If there are more than a few thousand households, then use geonear (also from SSC).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(household_id longitude latitude age) byte gender float(treat spillover)
1377812168   34.37838    -.0773 56 2 0 0
2380826616   34.33859   -.16009 44 1 0 0
1565866906   34.34427   -.18608 49 2 0 0
1380843870   34.32025   -.17002 35 2 0 0
2477843882 34.2900575 -.3020293 21 2 0 0
2471770455 34.3183856 -.1710468 53 1 0 0
2677826567   34.32582    -.1656 32 2 0 0
2780838123 34.2978333 -.2856966 50 1 0 0
2477874129 34.3183612 -.2013775 64 2 0 0
2682855367 34.3398756   -.29708 25 2 0 0
2973803553 34.2818906 -.2580746 34 2 0 0
1287876988   34.29492   -.16718 36 1 0 0
2386773330 34.3095594 -.1632476 28 2 0 0
2778865478  34.323366   -.13397 69 1 0 0
1768766098 34.3550887 -.1689325 27 2 0 0
1082861160 34.3528961 -.3369288 25 2 0 0
2366810741   34.29499   -.18036 34 1 0 0
1376872680 34.2881262 -.3954781 27 2 0 0
1766897177 34.2652793 -.3464681 21 1 0 0
1382836658   34.29842   -.28795 44 1 0 0
end
label values longitude dkrf
label values latitude dkrf
label values age dkrf
label values gender gender
label def gender 1 "1. male", modify
label def gender 2 "2. female", modify

save "statalist_data.dta", replace

* form all pairwise combinations of household
rename * =0
cross using "statalist_data.dta"

* distance between each combination
geodist latitude longitude latitude0 longitude0, gen(d)

* reduce to combinations that are within 1km
bysort household_id (d) household_id: keep if d < 1

format %12.0f house*
list household_id d household_id0, sepby(household_id)

Comment

George Kariuki

Join Date: Jul 2015

Posts: 93
#3

22 Mar 2017, 15:49

Many thanks to you Robert. Most appreciated. My sample had only 956 observations and thus the matrix of distances was quickly converging. Regards.
Comment

Ryan Kim

Join Date: Aug 2018
Posts: 26

10 Sep 2018, 16:45

Originally posted by Robert Picard View Post

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(household_id longitude latitude age) byte gender float(treat spillover)
1377812168 34.37838 -.0773 56 2 0 0
2380826616 34.33859 -.16009 44 1 0 0
1565866906 34.34427 -.18608 49 2 0 0
1380843870 34.32025 -.17002 35 2 0 0
2477843882 34.2900575 -.3020293 21 2 0 0
2471770455 34.3183856 -.1710468 53 1 0 0
2677826567 34.32582 -.1656 32 2 0 0
2780838123 34.2978333 -.2856966 50 1 0 0
2477874129 34.3183612 -.2013775 64 2 0 0
2682855367 34.3398756 -.29708 25 2 0 0
2973803553 34.2818906 -.2580746 34 2 0 0
1287876988 34.29492 -.16718 36 1 0 0
2386773330 34.3095594 -.1632476 28 2 0 0
2778865478 34.323366 -.13397 69 1 0 0
1768766098 34.3550887 -.1689325 27 2 0 0
1082861160 34.3528961 -.3369288 25 2 0 0
2366810741 34.29499 -.18036 34 1 0 0
1376872680 34.2881262 -.3954781 27 2 0 0
1766897177 34.2652793 -.3464681 21 1 0 0
1382836658 34.29842 -.28795 44 1 0 0
end
label values longitude dkrf
label values latitude dkrf
label values age dkrf
label values gender gender
label def gender 1 "1. male", modify
label def gender 2 "2. female", modify

save "statalist_data.dta", replace

* form all pairwise combinations of household
rename * =0
cross using "statalist_data.dta"

* distance between each combination
geodist latitude longitude latitude0 longitude0, gen(d)

* reduce to combinations that are within 1km
bysort household_id (d) household_id: keep if d < 1

format %12.0f house*
list household_id d household_id0, sepby(household_id)

Hi Robert,
Thank you for this post. I had a similar issue with George but I have panel data.
I tried running your code:

Code:

* form all pairwise combinations of household rename * =0 cross using "statalist_data.dta"

and it seems that my data expand to (i*i*t=The number of household*The number of household*The number of time periods) number of observations so Stata couldn't handle this since the number of observations becomes too large.
How do I overcome this issue when I have panel data?
Specifically, all I have to do is to calculate the distance between each id-pair like what George wanted to have.
But how do I create all unit-pairs without considering time dimension?

Thank you in advance.

Best,
Ryan

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

10 Sep 2018, 18:24

Hard to give advice without an example of your data and additional details on what you want. Do the locations change through time?
Comment

Ryan Kim

Join Date: Aug 2018
Posts: 26

11 Sep 2018, 18:28

Dear Robert,

I'm sorry for the vagueness.
No, the locations are the same over the time periods (time-invariant).

If I use an example data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(id year) float(lat lng) int price byte treatment
100 2000 45.49153 -73.583595  57 1
100 2001 45.49153 -73.583595  53 1
100 2002 45.49153 -73.583595  22 1
101 2000 45.50452 -73.563095 193 0
101 2001 45.50452 -73.563095 111 0
101 2002 45.50452 -73.563095 133 0
102 2001 45.47295 -73.617516 237 1
102 2002 45.47295 -73.617516 256 1
103 2002   45.493  -73.55708 122 0
103 2003   45.493  -73.55708 333 0
end

and then run the below codes,

Code:

rename * =0
cross using "test.dta"
geodist lat lng lat0 lng0, gen(d)

dataex in 1/30

The output looks like this (Stata generates all the possible ID*Year pairs so that the number of observations enormously increase in panel data setting):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(id0 year0) float(lat0 lng0) int price0 byte treatment0 int(id year) float(lat lng) int price byte treatment double d
100 2000 45.49153 -73.583595  57 1 100 2000 45.49153 -73.583595  57 1                  0
100 2001 45.49153 -73.583595  53 1 100 2000 45.49153 -73.583595  57 1                  0
100 2002 45.49153 -73.583595  22 1 100 2000 45.49153 -73.583595  57 1                  0
101 2000 45.50452 -73.563095 193 0 100 2000 45.49153 -73.583595  57 1 2.1570043536081713
101 2001 45.50452 -73.563095 111 0 100 2000 45.49153 -73.583595  57 1 2.1570043536081713
101 2002 45.50452 -73.563095 133 0 100 2000 45.49153 -73.583595  57 1 2.1570043536081713
102 2001 45.47295 -73.617516 237 1 100 2000 45.49153 -73.583595  57 1  3.361489722404731
102 2002 45.47295 -73.617516 256 1 100 2000 45.49153 -73.583595  57 1  3.361489722404731
103 2002   45.493  -73.55708 122 0 100 2000 45.49153 -73.583595  57 1  2.078806691172746
103 2003   45.493  -73.55708 333 0 100 2000 45.49153 -73.583595  57 1  2.078806691172746
100 2000 45.49153 -73.583595  57 1 100 2001 45.49153 -73.583595  53 1                  0
100 2000 45.49153 -73.583595  57 1 100 2002 45.49153 -73.583595  22 1                  0
100 2000 45.49153 -73.583595  57 1 101 2000 45.50452 -73.563095 193 0 2.1570043536084014
100 2000 45.49153 -73.583595  57 1 101 2001 45.50452 -73.563095 111 0 2.1570043536084014
100 2000 45.49153 -73.583595  57 1 101 2002 45.50452 -73.563095 133 0 2.1570043536084014
100 2000 45.49153 -73.583595  57 1 102 2001 45.47295 -73.617516 237 1  3.361489722404724
100 2000 45.49153 -73.583595  57 1 102 2002 45.47295 -73.617516 256 1  3.361489722404724
100 2000 45.49153 -73.583595  57 1 103 2002   45.493  -73.55708 122 0 2.0788066911727476
100 2000 45.49153 -73.583595  57 1 103 2003   45.493  -73.55708 333 0 2.0788066911727476
100 2001 45.49153 -73.583595  53 1 100 2001 45.49153 -73.583595  53 1                  0
100 2001 45.49153 -73.583595  53 1 100 2002 45.49153 -73.583595  22 1                  0
100 2001 45.49153 -73.583595  53 1 101 2000 45.50452 -73.563095 193 0 2.1570043536084014
100 2001 45.49153 -73.583595  53 1 101 2001 45.50452 -73.563095 111 0 2.1570043536084014
100 2001 45.49153 -73.583595  53 1 101 2002 45.50452 -73.563095 133 0 2.1570043536084014
100 2001 45.49153 -73.583595  53 1 102 2001 45.47295 -73.617516 237 1  3.361489722404724
100 2001 45.49153 -73.583595  53 1 102 2002 45.47295 -73.617516 256 1  3.361489722404724
100 2001 45.49153 -73.583595  53 1 103 2002   45.493  -73.55708 122 0 2.0788066911727476
100 2001 45.49153 -73.583595  53 1 103 2003   45.493  -73.55708 333 0 2.0788066911727476
100 2002 45.49153 -73.583595  22 1 100 2001 45.49153 -73.583595  53 1                  0
100 2002 45.49153 -73.583595  22 1 100 2002 45.49153 -73.583595  22 1                  0
end

If I try to run the codes using my actual data set, Stata gave me 'op. sys. refuses to provide memory'.
I assume that the number of observations of expanded data exceed the number of observations that Stata can handle.
I tried to allocate more memory to Stata, reboot the server, close other programs to increase memory capacity but still had the same issue (The server I use has 32GB memory & Stata MP).
For the brief description of my data set (just in case), it's panel data with 5,062 units, T=51, and the number of observations is 234,844.

So my question is:
Is there any way to form ID pairs, not ID*Time pairs, in panel data setting?

Please let me know if you're confused with any part.

Best,
Ryan

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

12 Sep 2018, 09:20

Thanks for the data example (and kudos for using dataex). Since the locations do not change over time, you can reduce the data to one observation per id and form all pairwise combinations using cross. With 5K units, that's 25M distances; it's doable but you are starting to push the limit of what can done with the brute force approach.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(id year) float(lat lng) int price byte treatment
100 2000 45.49153 -73.583595  57 1
100 2001 45.49153 -73.583595  53 1
100 2002 45.49153 -73.583595  22 1
101 2000 45.50452 -73.563095 193 0
101 2001 45.50452 -73.563095 111 0
101 2002 45.50452 -73.563095 133 0
102 2001 45.47295 -73.617516 237 1
102 2002 45.47295 -73.617516 256 1
103 2002   45.493  -73.55708 122 0
103 2003   45.493  -73.55708 333 0
end
save "dataex.dta", replace

* reduce to one observation per id, make sure that lat/lon does not change
bysort id lat lng: keep if _n == 1
by id: assert _n == 1
keep id lat lng treatment
save "dataex1.dta", replace

* form all pairwise combinations and calculate distances
rename * =0
cross using "dataex1.dta"
drop if id == id0
geodist lat lng lat0 lng0, gen(d) sphere

sort id0 d id
list id0 treatment0 d id treatment, sepby(id0)

and the results:

Code:

. list id0 treatment0 d id treatment, sepby(id0)

     +---------------------------------------------+
     | id0   treatm~0           d    id   treatm~t |
     |---------------------------------------------|
  1. | 100          1   2.0730181   103          0 |
  2. | 100          1   2.1538295   101          0 |
  3. | 100          1   3.3559457   102          1 |
     |---------------------------------------------|
  4. | 101          0   1.3640156   103          0 |
  5. | 101          0   2.1538295   100          1 |
  6. | 101          0   5.5063664   102          1 |
     |---------------------------------------------|
  7. | 102          1   3.3559457   100          1 |
  8. | 102          1   5.2122556   103          0 |
  9. | 102          1   5.5063664   101          0 |
     |---------------------------------------------|
 10. | 103          0   1.3640156   101          0 |
 11. | 103          0   2.0730181   100          1 |
 12. | 103          0   5.2122556   102          1 |
     +---------------------------------------------+

.

You do not give a hint of where you are going with this so it's hard to give further advice but assuming that you just want the nearest unit, you can use geonear (from SSC) as follows:

Code:

use "dataex1.dta", clear
rename id id0
geonear id0 lat lng using "dataex1.dta", n(id lat lng) ignoreself
list

and the results:

Code:

. list

     +---------------------------------------------------------+
     | id0        lat         lng   treatm~t   nid   km_to_nid |
     |---------------------------------------------------------|
  1. | 100   45.49153    -73.5836          1   103   2.0730181 |
  2. | 101   45.50452    -73.5631          0   103   1.3640156 |
  3. | 102   45.47295   -73.61752          1   100   3.3559457 |
  4. | 103     45.493   -73.55708          0   101   1.3640156 |
     +---------------------------------------------------------+

.

Once you have the information you want at the unit-level, you can merge it back with the original panel data.

Comment

Ryan Kim

Join Date: Aug 2018
Posts: 26

12 Sep 2018, 18:46

I appreciate your reply.

What I want to have in the end is just the distance var and the value for every ID pair (I'm going to use the distance var as one of my vars of interest; the distance between the focal unit i and other unit(s) j) .

Based on your comments, Below is what I've done.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(id year) float(lat lng) int price byte treatment
100 2000 45.49153 -73.583595  57 1
100 2001 45.49153 -73.583595  53 1
100 2002 45.49153 -73.583595  22 1
101 2000 45.50452 -73.563095 193 0
101 2001 45.50452 -73.563095 111 0
101 2002 45.50452 -73.563095 133 0
102 2001 45.47295 -73.617516 237 1
102 2002 45.47295 -73.617516 256 1
103 2002   45.493  -73.55708 122 0
103 2003   45.493  -73.55708 333 0
end

Code:

save "dataex.dta", replace

bysort id lat lng: keep if _n == 1
by id: assert _n == 1
keep id lat lng treatment

save "dataex1.dta", replace

rename * =0
cross using "dataex1.dta"
geodist lat lng lat0 lng0, gen(d)
sort id0 id

and then I'll get the result:

Code:

 list

     +---------------------------------------------------------------------------------------+
     | id0      lat0       lng0   treatm~0    id       lat        lng   treatm~t           d |
     |---------------------------------------------------------------------------------------|
  1. | 100   45.4915   -73.5836          1   100   45.4915   -73.5836          1           0 |
  2. | 100   45.4915   -73.5836          1   101   45.5045   -73.5631          0   2.1570044 |
  3. | 100   45.4915   -73.5836          1   102   45.4729   -73.6175          1   3.3614897 |
  4. | 100   45.4915   -73.5836          1   103    45.493   -73.5571          0   2.0788067 |
  5. | 101   45.5045   -73.5631          0   100   45.4915   -73.5836          1   2.1570044 |
     |---------------------------------------------------------------------------------------|
  6. | 101   45.5045   -73.5631          0   101   45.5045   -73.5631          0           0 |
  7. | 101   45.5045   -73.5631          0   102   45.4729   -73.6175          1   5.5150794 |
  8. | 101   45.5045   -73.5631          0   103    45.493   -73.5571          0   1.3646901 |
  9. | 102   45.4729   -73.6175          1   100   45.4915   -73.5836          1   3.3614897 |
 10. | 102   45.4729   -73.6175          1   101   45.5045   -73.5631          0   5.5150794 |
     |---------------------------------------------------------------------------------------|
 11. | 102   45.4729   -73.6175          1   102   45.4729   -73.6175          1           0 |
 12. | 102   45.4729   -73.6175          1   103    45.493   -73.5571          0   5.2238443 |
 13. | 103    45.493   -73.5571          0   100   45.4915   -73.5836          1   2.0788067 |
 14. | 103    45.493   -73.5571          0   101   45.5045   -73.5631          0   1.3646901 |
 15. | 103    45.493   -73.5571          0   102   45.4729   -73.6175          1   5.2238443 |
     |---------------------------------------------------------------------------------------|
 16. | 103    45.493   -73.5571          0   103    45.493   -73.5571          0           0 |
     +---------------------------------------------------------------------------------------+

save this file in a different name and then load the original data to merge each other.

Code:

save "dataex1_crossed.dta"
use "dataex.dta", clear
merge m:m id using "dataex1_crossed.dta"

sort id year id0

The result is:

Code:

list

     +--------------------------------------------------------------------------------------------------------------------+
     |  id   year       lat        lng   price   treatm~t   id0      lat0       lng0   treatm~0           d        _merge |
     |--------------------------------------------------------------------------------------------------------------------|
  1. | 100   2000   45.4915   -73.5836      57          1   103    45.493   -73.5571          0   2.0788067   matched (3) |
  2. | 100   2001   45.4915   -73.5836      53          1   100   45.4915   -73.5836          1           0   matched (3) |
  3. | 100   2002   45.4915   -73.5836      22          1   101   45.5045   -73.5631          0   2.1570044   matched (3) |
  4. | 100   2002   45.4915   -73.5836      22          1   102   45.4729   -73.6175          1   3.3614897   matched (3) |
  5. | 101   2000   45.5045   -73.5631     193          0   102   45.4729   -73.6175          1   5.5150794   matched (3) |
     |--------------------------------------------------------------------------------------------------------------------|
  6. | 101   2001   45.5045   -73.5631     111          0   100   45.4915   -73.5836          1   2.1570044   matched (3) |
  7. | 101   2002   45.5045   -73.5631     133          0   101   45.5045   -73.5631          0           0   matched (3) |
  8. | 101   2002   45.5045   -73.5631     133          0   103    45.493   -73.5571          0   1.3646901   matched (3) |
  9. | 102   2001   45.4729   -73.6175     237          1   102   45.4729   -73.6175          1           0   matched (3) |
 10. | 102   2002   45.4729   -73.6175     256          1   100   45.4915   -73.5836          1   3.3614897   matched (3) |
     |--------------------------------------------------------------------------------------------------------------------|
 11. | 102   2002   45.4729   -73.6175     256          1   101   45.5045   -73.5631          0   5.5150794   matched (3) |
 12. | 102   2002   45.4729   -73.6175     256          1   103    45.493   -73.5571          0   5.2238443   matched (3) |
 13. | 103   2002    45.493   -73.5571     122          0   102   45.4729   -73.6175          1   5.2238443   matched (3) |
 14. | 103   2003    45.493   -73.5571     333          0   100   45.4915   -73.5836          1   2.0788067   matched (3) |
 15. | 103   2003    45.493   -73.5571     333          0   101   45.5045   -73.5631          0   1.3646901   matched (3) |
     |--------------------------------------------------------------------------------------------------------------------|
 16. | 103   2003    45.493   -73.5571     333          0   103    45.493   -73.5571          0           0   matched (3) |
     +--------------------------------------------------------------------------------------------------------------------+

After all the steps, I see that there are all the possible ID pairs and I didn't lose any original ID*Year observations.
However, I noticed that I couldn't declare this data set as panel in Stata by using

Code:

xtset id year

since there exist duplicates.
If I drop the duplicates to set the data as panel, I will lose some pairwise combinations (some observations).
Is there any way to overcome this issue?

Best,
Ryan

Last edited by Ryan Kim; 12 Sep 2018, 18:55.

Announcement

distance computation using geodist package

Comment

Comment

Comment

Comment

Comment

Comment

Comment