geodist observations too large when calculating pairwise

Kyle Smith

Join Date: Mar 2017
Posts: 117

geodist observations too large when calculating pairwise

08 Aug 2019, 15:03

I am trying to calculate pairwise distances similar to those in the post here:

https://www.statalist.org/forums/for...ating-distance

The code for that post is pasted here for ease.

Code:

clear
input str10 id double year byte division float(latitude longitude)
"000361105" 2006 3   41.9663  -87.98085
"000361105" 2007 3   41.9663  -87.98085
"000361105" 2008 3   41.9663  -87.98085
"000886309" 2006 4  44.86384  -93.43001
"000886309" 2007 4  44.86384  -93.43001
"000886309" 2008 4  44.86384  -93.43001
"000957100" 2006 2  40.70963  -74.01297
"000957100" 2007 2  40.70963  -74.01297
"000957100" 2008 2  40.70963  -74.01297
"001084102" 2006 5   33.9743  -84.14539
"001084102" 2007 5   33.9743  -84.14539
"001084102" 2008 5   33.9743  -84.14539
"00130H105" 2006 5  38.87373  -77.11727
"00130H105" 2007 5  38.87373  -77.11727
"00130H105" 2008 5  38.87373  -77.11727
"001547108" 2008 3  39.34669  -84.41348
"001744101" 2006 9 32.945763 -117.21492
"001744101" 2007 9 32.945763 -117.21492
"001744101" 2008 9 32.945763 -117.21492
"00206R102" 2006 7  32.77925  -96.80477
"00206R102" 2007 7  32.77925  -96.80477
"00206R102" 2008 7  32.77925  -96.80477
"002567105" 2008 9 37.603153 -122.01864
"002824100" 2006 3  42.32403   -87.8567
"002824100" 2007 3  42.32403   -87.8567
"002824100" 2008 3  42.32403   -87.8567
"002896207" 2006 3  40.08307   -82.7989
"002896207" 2007 3  40.08307   -82.7989
"002896207" 2008 3  40.08307   -82.7989
"004498101" 2006 5 26.190046  -81.76478
"004498101" 2007 5 26.190046  -81.76478
"004498101" 2008 5 26.190046  -81.76478
"00508Y102" 2006 5 33.799896   -84.3858
"00508Y102" 2007 5 33.799896   -84.3858
"00508Y102" 2008 5 33.799896   -84.3858
"00724F101" 2006 9  37.34652 -121.90983
"00724F101" 2007 9  37.34652 -121.90983
"00724F101" 2008 9  37.34652 -121.90983
"00738A106" 2006 6   34.7608  -86.68575
"00738A106" 2007 6   34.7608  -86.68575
"00738A106" 2008 6   34.7608  -86.68575
"00751Y106" 2006 5 35.822914  -78.56207
"00751Y106" 2007 5 35.822914  -78.56207
"00751Y106" 2008 5 35.822914  -78.56207
"00770F104" 2006 4  38.64421  -90.65141
"00770F104" 2007 4  38.64421  -90.65141
"00770F104" 2008 4  38.64421  -90.65141
"007800105" 2006 9  33.91714 -118.40427
"007800105" 2007 9  33.91714 -118.40427
"007800105" 2008 9  33.91714 -118.40427
"007903107" 2006 9  37.39349 -121.96467
"007903107" 2007 9  37.39349 -121.96467
"007903107" 2008 9  37.39349 -121.96467
"007973100" 2006 8  40.52986 -105.03664
"007973100" 2007 8  40.52986 -105.03664
"007973100" 2008 8  40.52986 -105.03664
"008190100" 2006 2  40.77724 -74.392944
"008190100" 2007 2  40.77724 -74.392944
"008190100" 2008 2  40.77724 -74.392944
"00846U101" 2006 9  37.34864 -121.98438
"00846U101" 2007 9  37.34864 -121.98438
"00846U101" 2008 9  37.34864 -121.98438
"009158106" 2006 2  40.58427  -75.62479
"009158106" 2007 2  40.58427  -75.62479
"009158106" 2008 2  40.58427  -75.62479
"009363102" 2006 2  40.06187  -75.40246
"009363102" 2007 2  40.06187  -75.40246
"009363102" 2008 2  40.06187  -75.40246
"00971T101" 2007 1    42.362  -71.08159
"00971T101" 2008 1    42.362  -71.08159
"011659109" 2006 9  47.44657  -122.2721
"011659109" 2007 9  47.44657  -122.2721
"011659109" 2008 9  47.44657  -122.2721
"012348108" 2006 1  43.30168  -70.99177
"012348108" 2007 1  43.30168  -70.99177
"012348108" 2008 1  43.30168  -70.99177
"012653101" 2006 5 35.178715  -80.85399
"012653101" 2007 5 35.178715  -80.85399
"012653101" 2008 5 35.178715  -80.85399
"013078100" 2008 3  41.90414  -87.86066
end

egen group= group(year division), label
preserve
rename (*) (*2)
rename group2 group
tempfile file2
save `file2'
restore
joinby group using `file2'
bys group: egen count= count(id)
drop if id==id2 & count>1

geodist latitude longitude latitude2 longitude2, gen(dist)
bys id year group: egen tot_dist= total(dist)
bys id year group: egen pairs= count(id)
bys id year group: gen avg_dist= tot_dist/pairs

keep id year group tot_dist avg_dist division
duplicates drop id year group, force

My problem is that when I

Code:

joinby group using `file2'

The number of observations is > 2B.

Is there a way to split up and speed up the job?

Is it possible to:
i) do a loop by group to calculate the total and average distance for each group (year division)
ii) then collapse the data at the group level
iii) save the file
iv) append all the group files once all groups are complete

Or if there is an easier way to do this, that is fine too? I don't know how to tackle this. This is a very unusual problem for me.

Thanks in advance.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2413
#2

08 Aug 2019, 15:56

Is your problem the memory demands of the data file, or the duration of the run time? The former might be improved, the latter likely not. And, is the 2e9 the size of the original data file, or the number of pairs after -joinby-? And, are yhe distances between any two ids constant across years?
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#3

08 Aug 2019, 16:01

Mike,

The 2e9 size is after the joinby. If I cannot increase the duration, then maybe this doesn't matter.

I am not sure if the file sizes (after joinby) will exceed the RAM of the computer? Is that a problem?

I am not sure what you are asking in your last question? "do the distances between any two ids every vary across years?" What do you mean by "every"? Do you mean "ever"? That is a smart question. The answer is no. Not sure how to use it though?

Thanks.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#4

08 Aug 2019, 16:11

Yes, sorry about the typo. I fixed it but not before the error bit you.. My point there is that I guessed that the location of the ids (cities? census tracts?) are presumably the same year after year. If so, you might cut the number of pairs drastically by using just a data file of ids without years.

Yes, 2e9 of observations might get pretty big. Your id variable takes 10 bytes, meaning that just the two id variables of your joined data set would require 40e9 of memory. But you only apparently have something like 44e3 original observations (sqrt(2e9)). A long variable would likely be sufficient to identify your observations. See -help datatypes-
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

08 Aug 2019, 19:41

The joinby solution is easy to code but if it becomes unwieldy you can break up the problem and process each group separately. Normally, this would entail a fair amount of data management gymnastics but you can use runby (from SSC) to handle most of the details:

Code:

clear all

program do1group

    tempfile hold
    save "`hold'"
    rename (id latitude longitude) (id2 latitude2 longitude2)
    drop year division
    cross using "`hold'"

    geodist latitude longitude latitude2 longitude2 if id != id2, gen(d)
    bysort id: egen avg_dist = mean(d)
    
    drop id2 latitude2 longitude2 d
    by id: keep if _n == 1
    
end


use "statalist_ex.dta", clear
runby do1group, by(year division) status

See help runby for more details.

Comment

Kyle Smith

Join Date: Mar 2017

Posts: 117
#6

08 Aug 2019, 19:43

Your idea is a good one.

I get this error (maybe not surprising):

op. sys. refuses to provide memory
r(909);

Any idea how to break the batches (up in a loop)?
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 117
#7

08 Aug 2019, 20:29

Sorry Robert, I posted the same time as you. Did not see your post before I posted.

Your code worked like a charm. Thanks alot.
Comment

Announcement

geodist observations too large when calculating pairwise

Comment

Comment

Comment

Comment

Comment

Comment