Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • geodist observations too large when calculating pairwise

    I am trying to calculate pairwise distances similar to those in the post here:

    https://www.statalist.org/forums/for...ating-distance

    The code for that post is pasted here for ease.

    Code:
    clear
    input str10 id double year byte division float(latitude longitude)
    "000361105" 2006 3   41.9663  -87.98085
    "000361105" 2007 3   41.9663  -87.98085
    "000361105" 2008 3   41.9663  -87.98085
    "000886309" 2006 4  44.86384  -93.43001
    "000886309" 2007 4  44.86384  -93.43001
    "000886309" 2008 4  44.86384  -93.43001
    "000957100" 2006 2  40.70963  -74.01297
    "000957100" 2007 2  40.70963  -74.01297
    "000957100" 2008 2  40.70963  -74.01297
    "001084102" 2006 5   33.9743  -84.14539
    "001084102" 2007 5   33.9743  -84.14539
    "001084102" 2008 5   33.9743  -84.14539
    "00130H105" 2006 5  38.87373  -77.11727
    "00130H105" 2007 5  38.87373  -77.11727
    "00130H105" 2008 5  38.87373  -77.11727
    "001547108" 2008 3  39.34669  -84.41348
    "001744101" 2006 9 32.945763 -117.21492
    "001744101" 2007 9 32.945763 -117.21492
    "001744101" 2008 9 32.945763 -117.21492
    "00206R102" 2006 7  32.77925  -96.80477
    "00206R102" 2007 7  32.77925  -96.80477
    "00206R102" 2008 7  32.77925  -96.80477
    "002567105" 2008 9 37.603153 -122.01864
    "002824100" 2006 3  42.32403   -87.8567
    "002824100" 2007 3  42.32403   -87.8567
    "002824100" 2008 3  42.32403   -87.8567
    "002896207" 2006 3  40.08307   -82.7989
    "002896207" 2007 3  40.08307   -82.7989
    "002896207" 2008 3  40.08307   -82.7989
    "004498101" 2006 5 26.190046  -81.76478
    "004498101" 2007 5 26.190046  -81.76478
    "004498101" 2008 5 26.190046  -81.76478
    "00508Y102" 2006 5 33.799896   -84.3858
    "00508Y102" 2007 5 33.799896   -84.3858
    "00508Y102" 2008 5 33.799896   -84.3858
    "00724F101" 2006 9  37.34652 -121.90983
    "00724F101" 2007 9  37.34652 -121.90983
    "00724F101" 2008 9  37.34652 -121.90983
    "00738A106" 2006 6   34.7608  -86.68575
    "00738A106" 2007 6   34.7608  -86.68575
    "00738A106" 2008 6   34.7608  -86.68575
    "00751Y106" 2006 5 35.822914  -78.56207
    "00751Y106" 2007 5 35.822914  -78.56207
    "00751Y106" 2008 5 35.822914  -78.56207
    "00770F104" 2006 4  38.64421  -90.65141
    "00770F104" 2007 4  38.64421  -90.65141
    "00770F104" 2008 4  38.64421  -90.65141
    "007800105" 2006 9  33.91714 -118.40427
    "007800105" 2007 9  33.91714 -118.40427
    "007800105" 2008 9  33.91714 -118.40427
    "007903107" 2006 9  37.39349 -121.96467
    "007903107" 2007 9  37.39349 -121.96467
    "007903107" 2008 9  37.39349 -121.96467
    "007973100" 2006 8  40.52986 -105.03664
    "007973100" 2007 8  40.52986 -105.03664
    "007973100" 2008 8  40.52986 -105.03664
    "008190100" 2006 2  40.77724 -74.392944
    "008190100" 2007 2  40.77724 -74.392944
    "008190100" 2008 2  40.77724 -74.392944
    "00846U101" 2006 9  37.34864 -121.98438
    "00846U101" 2007 9  37.34864 -121.98438
    "00846U101" 2008 9  37.34864 -121.98438
    "009158106" 2006 2  40.58427  -75.62479
    "009158106" 2007 2  40.58427  -75.62479
    "009158106" 2008 2  40.58427  -75.62479
    "009363102" 2006 2  40.06187  -75.40246
    "009363102" 2007 2  40.06187  -75.40246
    "009363102" 2008 2  40.06187  -75.40246
    "00971T101" 2007 1    42.362  -71.08159
    "00971T101" 2008 1    42.362  -71.08159
    "011659109" 2006 9  47.44657  -122.2721
    "011659109" 2007 9  47.44657  -122.2721
    "011659109" 2008 9  47.44657  -122.2721
    "012348108" 2006 1  43.30168  -70.99177
    "012348108" 2007 1  43.30168  -70.99177
    "012348108" 2008 1  43.30168  -70.99177
    "012653101" 2006 5 35.178715  -80.85399
    "012653101" 2007 5 35.178715  -80.85399
    "012653101" 2008 5 35.178715  -80.85399
    "013078100" 2008 3  41.90414  -87.86066
    end
    
    egen group= group(year division), label
    preserve
    rename (*) (*2)
    rename group2 group
    tempfile file2
    save `file2'
    restore
    joinby group using `file2'
    bys group: egen count= count(id)
    drop if id==id2 & count>1
    
    geodist latitude longitude latitude2 longitude2, gen(dist)
    bys id year group: egen tot_dist= total(dist)
    bys id year group: egen pairs= count(id)
    bys id year group: gen avg_dist= tot_dist/pairs
    
    keep id year group tot_dist avg_dist division
    duplicates drop id year group, force
    My problem is that when I

    Code:
    joinby group using `file2'
    The number of observations is > 2B.

    Is there a way to split up and speed up the job?

    Is it possible to:
    i) do a loop by group to calculate the total and average distance for each group (year division)
    ii) then collapse the data at the group level
    iii) save the file
    iv) append all the group files once all groups are complete

    Or if there is an easier way to do this, that is fine too? I don't know how to tackle this. This is a very unusual problem for me.

    Thanks in advance.



  • #2
    Is your problem the memory demands of the data file, or the duration of the run time? The former might be improved, the latter likely not. And, is the 2e9 the size of the original data file, or the number of pairs after -joinby-? And, are yhe distances between any two ids constant across years?

    Comment


    • #3
      Mike,

      The 2e9 size is after the joinby. If I cannot increase the duration, then maybe this doesn't matter.

      I am not sure if the file sizes (after joinby) will exceed the RAM of the computer? Is that a problem?

      I am not sure what you are asking in your last question? "do the distances between any two ids every vary across years?" What do you mean by "every"? Do you mean "ever"? That is a smart question. The answer is no. Not sure how to use it though?

      Thanks.

      Comment


      • #4
        Yes, sorry about the typo. I fixed it but not before the error bit you.. My point there is that I guessed that the location of the ids (cities? census tracts?) are presumably the same year after year. If so, you might cut the number of pairs drastically by using just a data file of ids without years.

        Yes, 2e9 of observations might get pretty big. Your id variable takes 10 bytes, meaning that just the two id variables of your joined data set would require 40e9 of memory. But you only apparently have something like 44e3 original observations (sqrt(2e9)). A long variable would likely be sufficient to identify your observations. See -help datatypes-

        Comment


        • #5
          The joinby solution is easy to code but if it becomes unwieldy you can break up the problem and process each group separately. Normally, this would entail a fair amount of data management gymnastics but you can use runby (from SSC) to handle most of the details:

          Code:
          clear all
          
          program do1group
          
              tempfile hold
              save "`hold'"
              rename (id latitude longitude) (id2 latitude2 longitude2)
              drop year division
              cross using "`hold'"
          
              geodist latitude longitude latitude2 longitude2 if id != id2, gen(d)
              bysort id: egen avg_dist = mean(d)
              
              drop id2 latitude2 longitude2 d
              by id: keep if _n == 1
              
          end
          
          
          use "statalist_ex.dta", clear
          runby do1group, by(year division) status
          See help runby for more details.

          Comment


          • #6
            Your idea is a good one.

            I get this error (maybe not surprising):

            op. sys. refuses to provide memory
            r(909);


            Any idea how to break the batches (up in a loop)?

            Comment


            • #7
              Sorry Robert, I posted the same time as you. Did not see your post before I posted.

              Your code worked like a charm. Thanks alot.

              Comment

              Working...
              X