I am trying to calculate pairwise distances similar to those in the post here:
https://www.statalist.org/forums/for...ating-distance
The code for that post is pasted here for ease.
My problem is that when I
The number of observations is > 2B.
Is there a way to split up and speed up the job?
Is it possible to:
i) do a loop by group to calculate the total and average distance for each group (year division)
ii) then collapse the data at the group level
iii) save the file
iv) append all the group files once all groups are complete
Or if there is an easier way to do this, that is fine too? I don't know how to tackle this. This is a very unusual problem for me.
Thanks in advance.
https://www.statalist.org/forums/for...ating-distance
The code for that post is pasted here for ease.
Code:
clear input str10 id double year byte division float(latitude longitude) "000361105" 2006 3 41.9663 -87.98085 "000361105" 2007 3 41.9663 -87.98085 "000361105" 2008 3 41.9663 -87.98085 "000886309" 2006 4 44.86384 -93.43001 "000886309" 2007 4 44.86384 -93.43001 "000886309" 2008 4 44.86384 -93.43001 "000957100" 2006 2 40.70963 -74.01297 "000957100" 2007 2 40.70963 -74.01297 "000957100" 2008 2 40.70963 -74.01297 "001084102" 2006 5 33.9743 -84.14539 "001084102" 2007 5 33.9743 -84.14539 "001084102" 2008 5 33.9743 -84.14539 "00130H105" 2006 5 38.87373 -77.11727 "00130H105" 2007 5 38.87373 -77.11727 "00130H105" 2008 5 38.87373 -77.11727 "001547108" 2008 3 39.34669 -84.41348 "001744101" 2006 9 32.945763 -117.21492 "001744101" 2007 9 32.945763 -117.21492 "001744101" 2008 9 32.945763 -117.21492 "00206R102" 2006 7 32.77925 -96.80477 "00206R102" 2007 7 32.77925 -96.80477 "00206R102" 2008 7 32.77925 -96.80477 "002567105" 2008 9 37.603153 -122.01864 "002824100" 2006 3 42.32403 -87.8567 "002824100" 2007 3 42.32403 -87.8567 "002824100" 2008 3 42.32403 -87.8567 "002896207" 2006 3 40.08307 -82.7989 "002896207" 2007 3 40.08307 -82.7989 "002896207" 2008 3 40.08307 -82.7989 "004498101" 2006 5 26.190046 -81.76478 "004498101" 2007 5 26.190046 -81.76478 "004498101" 2008 5 26.190046 -81.76478 "00508Y102" 2006 5 33.799896 -84.3858 "00508Y102" 2007 5 33.799896 -84.3858 "00508Y102" 2008 5 33.799896 -84.3858 "00724F101" 2006 9 37.34652 -121.90983 "00724F101" 2007 9 37.34652 -121.90983 "00724F101" 2008 9 37.34652 -121.90983 "00738A106" 2006 6 34.7608 -86.68575 "00738A106" 2007 6 34.7608 -86.68575 "00738A106" 2008 6 34.7608 -86.68575 "00751Y106" 2006 5 35.822914 -78.56207 "00751Y106" 2007 5 35.822914 -78.56207 "00751Y106" 2008 5 35.822914 -78.56207 "00770F104" 2006 4 38.64421 -90.65141 "00770F104" 2007 4 38.64421 -90.65141 "00770F104" 2008 4 38.64421 -90.65141 "007800105" 2006 9 33.91714 -118.40427 "007800105" 2007 9 33.91714 -118.40427 "007800105" 2008 9 33.91714 -118.40427 "007903107" 2006 9 37.39349 -121.96467 "007903107" 2007 9 37.39349 -121.96467 "007903107" 2008 9 37.39349 -121.96467 "007973100" 2006 8 40.52986 -105.03664 "007973100" 2007 8 40.52986 -105.03664 "007973100" 2008 8 40.52986 -105.03664 "008190100" 2006 2 40.77724 -74.392944 "008190100" 2007 2 40.77724 -74.392944 "008190100" 2008 2 40.77724 -74.392944 "00846U101" 2006 9 37.34864 -121.98438 "00846U101" 2007 9 37.34864 -121.98438 "00846U101" 2008 9 37.34864 -121.98438 "009158106" 2006 2 40.58427 -75.62479 "009158106" 2007 2 40.58427 -75.62479 "009158106" 2008 2 40.58427 -75.62479 "009363102" 2006 2 40.06187 -75.40246 "009363102" 2007 2 40.06187 -75.40246 "009363102" 2008 2 40.06187 -75.40246 "00971T101" 2007 1 42.362 -71.08159 "00971T101" 2008 1 42.362 -71.08159 "011659109" 2006 9 47.44657 -122.2721 "011659109" 2007 9 47.44657 -122.2721 "011659109" 2008 9 47.44657 -122.2721 "012348108" 2006 1 43.30168 -70.99177 "012348108" 2007 1 43.30168 -70.99177 "012348108" 2008 1 43.30168 -70.99177 "012653101" 2006 5 35.178715 -80.85399 "012653101" 2007 5 35.178715 -80.85399 "012653101" 2008 5 35.178715 -80.85399 "013078100" 2008 3 41.90414 -87.86066 end egen group= group(year division), label preserve rename (*) (*2) rename group2 group tempfile file2 save `file2' restore joinby group using `file2' bys group: egen count= count(id) drop if id==id2 & count>1 geodist latitude longitude latitude2 longitude2, gen(dist) bys id year group: egen tot_dist= total(dist) bys id year group: egen pairs= count(id) bys id year group: gen avg_dist= tot_dist/pairs keep id year group tot_dist avg_dist division duplicates drop id year group, force
Code:
joinby group using `file2'
Is there a way to split up and speed up the job?
Is it possible to:
i) do a loop by group to calculate the total and average distance for each group (year division)
ii) then collapse the data at the group level
iii) save the file
iv) append all the group files once all groups are complete
Or if there is an easier way to do this, that is fine too? I don't know how to tackle this. This is a very unusual problem for me.
Thanks in advance.
Comment