Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • distance computation using geodist package

    I have a simple data set with longitudes and latitudes for every observation. I wish to compute distances so I can get distances between households of less than 1 KM neighborhood for a simple analysis on spill over effects of a program intervention. Here is a subsample of the data.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(household_id longitude latitude age) byte gender float(treat spillover)
    1377812168   34.37838    -.0773 56 2 0 0
    2380826616   34.33859   -.16009 44 1 0 0
    1565866906   34.34427   -.18608 49 2 0 0
    1380843870   34.32025   -.17002 35 2 0 0
    2477843882 34.2900575 -.3020293 21 2 0 0
    2471770455 34.3183856 -.1710468 53 1 0 0
    2677826567   34.32582    -.1656 32 2 0 0
    2780838123 34.2978333 -.2856966 50 1 0 0
    2477874129 34.3183612 -.2013775 64 2 0 0
    2682855367 34.3398756   -.29708 25 2 0 0
    2973803553 34.2818906 -.2580746 34 2 0 0
    1287876988   34.29492   -.16718 36 1 0 0
    2386773330 34.3095594 -.1632476 28 2 0 0
    2778865478  34.323366   -.13397 69 1 0 0
    1768766098 34.3550887 -.1689325 27 2 0 0
    1082861160 34.3528961 -.3369288 25 2 0 0
    2366810741   34.29499   -.18036 34 1 0 0
    1376872680 34.2881262 -.3954781 27 2 0 0
    1766897177 34.2652793 -.3464681 21 1 0 0
    1382836658   34.29842   -.28795 44 1 0 0
    end
    label values longitude dkrf
    label values latitude dkrf
    label values age dkrf
    label values gender gender
    label def gender 1 "1. male", modify
    label def gender 2 "2. female", modify
    I don't seem to know how to generate this variable containing shortest distances from HH i to j which are obviously neighbors. I am using the
    Code:
    geodist
    package due to Robert Picard (Highly appreciated). Could someone assist please?

  • #2
    If the dataset is not too large, then the easiest is to form all pairwise combinations of households and calculate the distances between them. You the reduce the data to those that are within 1km. Note that the distance to self is zero. If there are more than a few thousand households, then use geonear (also from SSC).

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(household_id longitude latitude age) byte gender float(treat spillover)
    1377812168   34.37838    -.0773 56 2 0 0
    2380826616   34.33859   -.16009 44 1 0 0
    1565866906   34.34427   -.18608 49 2 0 0
    1380843870   34.32025   -.17002 35 2 0 0
    2477843882 34.2900575 -.3020293 21 2 0 0
    2471770455 34.3183856 -.1710468 53 1 0 0
    2677826567   34.32582    -.1656 32 2 0 0
    2780838123 34.2978333 -.2856966 50 1 0 0
    2477874129 34.3183612 -.2013775 64 2 0 0
    2682855367 34.3398756   -.29708 25 2 0 0
    2973803553 34.2818906 -.2580746 34 2 0 0
    1287876988   34.29492   -.16718 36 1 0 0
    2386773330 34.3095594 -.1632476 28 2 0 0
    2778865478  34.323366   -.13397 69 1 0 0
    1768766098 34.3550887 -.1689325 27 2 0 0
    1082861160 34.3528961 -.3369288 25 2 0 0
    2366810741   34.29499   -.18036 34 1 0 0
    1376872680 34.2881262 -.3954781 27 2 0 0
    1766897177 34.2652793 -.3464681 21 1 0 0
    1382836658   34.29842   -.28795 44 1 0 0
    end
    label values longitude dkrf
    label values latitude dkrf
    label values age dkrf
    label values gender gender
    label def gender 1 "1. male", modify
    label def gender 2 "2. female", modify
    
    save "statalist_data.dta", replace
    
    * form all pairwise combinations of household
    rename * =0
    cross using "statalist_data.dta"
    
    * distance between each combination
    geodist latitude longitude latitude0 longitude0, gen(d)
    
    * reduce to combinations that are within 1km
    bysort household_id (d) household_id: keep if d < 1
    
    format %12.0f house*
    list household_id d household_id0, sepby(household_id)


    Comment


    • #3
      Many thanks to you Robert. Most appreciated. My sample had only 956 observations and thus the matrix of distances was quickly converging. Regards.

      Comment


      • #4
        Originally posted by Robert Picard View Post
        If the dataset is not too large, then the easiest is to form all pairwise combinations of households and calculate the distances between them. You the reduce the data to those that are within 1km. Note that the distance to self is zero. If there are more than a few thousand households, then use geonear (also from SSC).

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input double(household_id longitude latitude age) byte gender float(treat spillover)
        1377812168 34.37838 -.0773 56 2 0 0
        2380826616 34.33859 -.16009 44 1 0 0
        1565866906 34.34427 -.18608 49 2 0 0
        1380843870 34.32025 -.17002 35 2 0 0
        2477843882 34.2900575 -.3020293 21 2 0 0
        2471770455 34.3183856 -.1710468 53 1 0 0
        2677826567 34.32582 -.1656 32 2 0 0
        2780838123 34.2978333 -.2856966 50 1 0 0
        2477874129 34.3183612 -.2013775 64 2 0 0
        2682855367 34.3398756 -.29708 25 2 0 0
        2973803553 34.2818906 -.2580746 34 2 0 0
        1287876988 34.29492 -.16718 36 1 0 0
        2386773330 34.3095594 -.1632476 28 2 0 0
        2778865478 34.323366 -.13397 69 1 0 0
        1768766098 34.3550887 -.1689325 27 2 0 0
        1082861160 34.3528961 -.3369288 25 2 0 0
        2366810741 34.29499 -.18036 34 1 0 0
        1376872680 34.2881262 -.3954781 27 2 0 0
        1766897177 34.2652793 -.3464681 21 1 0 0
        1382836658 34.29842 -.28795 44 1 0 0
        end
        label values longitude dkrf
        label values latitude dkrf
        label values age dkrf
        label values gender gender
        label def gender 1 "1. male", modify
        label def gender 2 "2. female", modify
        
        save "statalist_data.dta", replace
        
        * form all pairwise combinations of household
        rename * =0
        cross using "statalist_data.dta"
        
        * distance between each combination
        geodist latitude longitude latitude0 longitude0, gen(d)
        
        * reduce to combinations that are within 1km
        bysort household_id (d) household_id: keep if d < 1
        
        format %12.0f house*
        list household_id d household_id0, sepby(household_id)

        Hi Robert,
        Thank you for this post. I had a similar issue with George but I have panel data.
        I tried running your code:

        Code:
        * form all pairwise combinations of household rename * =0 cross using "statalist_data.dta"
        and it seems that my data expand to (i*i*t=The number of household*The number of household*The number of time periods) number of observations so Stata couldn't handle this since the number of observations becomes too large.
        How do I overcome this issue when I have panel data?
        Specifically, all I have to do is to calculate the distance between each id-pair like what George wanted to have.
        But how do I create all unit-pairs without considering time dimension?


        Thank you in advance.

        Best,
        Ryan


        Comment


        • #5
          Hard to give advice without an example of your data and additional details on what you want. Do the locations change through time?

          Comment


          • #6
            Dear Robert,

            I'm sorry for the vagueness.
            No, the locations are the same over the time periods (time-invariant).

            If I use an example data:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input int(id year) float(lat lng) int price byte treatment
            100 2000 45.49153 -73.583595  57 1
            100 2001 45.49153 -73.583595  53 1
            100 2002 45.49153 -73.583595  22 1
            101 2000 45.50452 -73.563095 193 0
            101 2001 45.50452 -73.563095 111 0
            101 2002 45.50452 -73.563095 133 0
            102 2001 45.47295 -73.617516 237 1
            102 2002 45.47295 -73.617516 256 1
            103 2002   45.493  -73.55708 122 0
            103 2003   45.493  -73.55708 333 0
            end
            and then run the below codes,

            Code:
            rename * =0
            cross using "test.dta"
            geodist lat lng lat0 lng0, gen(d)
            
            dataex in 1/30

            The output looks like this (Stata generates all the possible ID*Year pairs so that the number of observations enormously increase in panel data setting):

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input int(id0 year0) float(lat0 lng0) int price0 byte treatment0 int(id year) float(lat lng) int price byte treatment double d
            100 2000 45.49153 -73.583595  57 1 100 2000 45.49153 -73.583595  57 1                  0
            100 2001 45.49153 -73.583595  53 1 100 2000 45.49153 -73.583595  57 1                  0
            100 2002 45.49153 -73.583595  22 1 100 2000 45.49153 -73.583595  57 1                  0
            101 2000 45.50452 -73.563095 193 0 100 2000 45.49153 -73.583595  57 1 2.1570043536081713
            101 2001 45.50452 -73.563095 111 0 100 2000 45.49153 -73.583595  57 1 2.1570043536081713
            101 2002 45.50452 -73.563095 133 0 100 2000 45.49153 -73.583595  57 1 2.1570043536081713
            102 2001 45.47295 -73.617516 237 1 100 2000 45.49153 -73.583595  57 1  3.361489722404731
            102 2002 45.47295 -73.617516 256 1 100 2000 45.49153 -73.583595  57 1  3.361489722404731
            103 2002   45.493  -73.55708 122 0 100 2000 45.49153 -73.583595  57 1  2.078806691172746
            103 2003   45.493  -73.55708 333 0 100 2000 45.49153 -73.583595  57 1  2.078806691172746
            100 2000 45.49153 -73.583595  57 1 100 2001 45.49153 -73.583595  53 1                  0
            100 2000 45.49153 -73.583595  57 1 100 2002 45.49153 -73.583595  22 1                  0
            100 2000 45.49153 -73.583595  57 1 101 2000 45.50452 -73.563095 193 0 2.1570043536084014
            100 2000 45.49153 -73.583595  57 1 101 2001 45.50452 -73.563095 111 0 2.1570043536084014
            100 2000 45.49153 -73.583595  57 1 101 2002 45.50452 -73.563095 133 0 2.1570043536084014
            100 2000 45.49153 -73.583595  57 1 102 2001 45.47295 -73.617516 237 1  3.361489722404724
            100 2000 45.49153 -73.583595  57 1 102 2002 45.47295 -73.617516 256 1  3.361489722404724
            100 2000 45.49153 -73.583595  57 1 103 2002   45.493  -73.55708 122 0 2.0788066911727476
            100 2000 45.49153 -73.583595  57 1 103 2003   45.493  -73.55708 333 0 2.0788066911727476
            100 2001 45.49153 -73.583595  53 1 100 2001 45.49153 -73.583595  53 1                  0
            100 2001 45.49153 -73.583595  53 1 100 2002 45.49153 -73.583595  22 1                  0
            100 2001 45.49153 -73.583595  53 1 101 2000 45.50452 -73.563095 193 0 2.1570043536084014
            100 2001 45.49153 -73.583595  53 1 101 2001 45.50452 -73.563095 111 0 2.1570043536084014
            100 2001 45.49153 -73.583595  53 1 101 2002 45.50452 -73.563095 133 0 2.1570043536084014
            100 2001 45.49153 -73.583595  53 1 102 2001 45.47295 -73.617516 237 1  3.361489722404724
            100 2001 45.49153 -73.583595  53 1 102 2002 45.47295 -73.617516 256 1  3.361489722404724
            100 2001 45.49153 -73.583595  53 1 103 2002   45.493  -73.55708 122 0 2.0788066911727476
            100 2001 45.49153 -73.583595  53 1 103 2003   45.493  -73.55708 333 0 2.0788066911727476
            100 2002 45.49153 -73.583595  22 1 100 2001 45.49153 -73.583595  53 1                  0
            100 2002 45.49153 -73.583595  22 1 100 2002 45.49153 -73.583595  22 1                  0
            end
            If I try to run the codes using my actual data set, Stata gave me 'op. sys. refuses to provide memory'.
            I assume that the number of observations of expanded data exceed the number of observations that Stata can handle.
            I tried to allocate more memory to Stata, reboot the server, close other programs to increase memory capacity but still had the same issue (The server I use has 32GB memory & Stata MP).
            For the brief description of my data set (just in case), it's panel data with 5,062 units, T=51, and the number of observations is 234,844.

            So my question is:
            Is there any way to form ID pairs, not ID*Time pairs, in panel data setting?

            Please let me know if you're confused with any part.

            Best,
            Ryan

            Comment


            • #7
              Thanks for the data example (and kudos for using dataex). Since the locations do not change over time, you can reduce the data to one observation per id and form all pairwise combinations using cross. With 5K units, that's 25M distances; it's doable but you are starting to push the limit of what can done with the brute force approach.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input int(id year) float(lat lng) int price byte treatment
              100 2000 45.49153 -73.583595  57 1
              100 2001 45.49153 -73.583595  53 1
              100 2002 45.49153 -73.583595  22 1
              101 2000 45.50452 -73.563095 193 0
              101 2001 45.50452 -73.563095 111 0
              101 2002 45.50452 -73.563095 133 0
              102 2001 45.47295 -73.617516 237 1
              102 2002 45.47295 -73.617516 256 1
              103 2002   45.493  -73.55708 122 0
              103 2003   45.493  -73.55708 333 0
              end
              save "dataex.dta", replace
              
              * reduce to one observation per id, make sure that lat/lon does not change
              bysort id lat lng: keep if _n == 1
              by id: assert _n == 1
              keep id lat lng treatment
              save "dataex1.dta", replace
              
              * form all pairwise combinations and calculate distances
              rename * =0
              cross using "dataex1.dta"
              drop if id == id0
              geodist lat lng lat0 lng0, gen(d) sphere
              
              sort id0 d id
              list id0 treatment0 d id treatment, sepby(id0)
              and the results:
              Code:
              . list id0 treatment0 d id treatment, sepby(id0)
              
                   +---------------------------------------------+
                   | id0   treatm~0           d    id   treatm~t |
                   |---------------------------------------------|
                1. | 100          1   2.0730181   103          0 |
                2. | 100          1   2.1538295   101          0 |
                3. | 100          1   3.3559457   102          1 |
                   |---------------------------------------------|
                4. | 101          0   1.3640156   103          0 |
                5. | 101          0   2.1538295   100          1 |
                6. | 101          0   5.5063664   102          1 |
                   |---------------------------------------------|
                7. | 102          1   3.3559457   100          1 |
                8. | 102          1   5.2122556   103          0 |
                9. | 102          1   5.5063664   101          0 |
                   |---------------------------------------------|
               10. | 103          0   1.3640156   101          0 |
               11. | 103          0   2.0730181   100          1 |
               12. | 103          0   5.2122556   102          1 |
                   +---------------------------------------------+
              
              .
              You do not give a hint of where you are going with this so it's hard to give further advice but assuming that you just want the nearest unit, you can use geonear (from SSC) as follows:
              Code:
              use "dataex1.dta", clear
              rename id id0
              geonear id0 lat lng using "dataex1.dta", n(id lat lng) ignoreself
              list
              and the results:
              Code:
              . list
              
                   +---------------------------------------------------------+
                   | id0        lat         lng   treatm~t   nid   km_to_nid |
                   |---------------------------------------------------------|
                1. | 100   45.49153    -73.5836          1   103   2.0730181 |
                2. | 101   45.50452    -73.5631          0   103   1.3640156 |
                3. | 102   45.47295   -73.61752          1   100   3.3559457 |
                4. | 103     45.493   -73.55708          0   101   1.3640156 |
                   +---------------------------------------------------------+
              
              .
              Once you have the information you want at the unit-level, you can merge it back with the original panel data.

              Comment


              • #8
                I appreciate your reply.

                What I want to have in the end is just the distance var and the value for every ID pair (I'm going to use the distance var as one of my vars of interest; the distance between the focal unit i and other unit(s) j) .

                Based on your comments, Below is what I've done.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input int(id year) float(lat lng) int price byte treatment
                100 2000 45.49153 -73.583595  57 1
                100 2001 45.49153 -73.583595  53 1
                100 2002 45.49153 -73.583595  22 1
                101 2000 45.50452 -73.563095 193 0
                101 2001 45.50452 -73.563095 111 0
                101 2002 45.50452 -73.563095 133 0
                102 2001 45.47295 -73.617516 237 1
                102 2002 45.47295 -73.617516 256 1
                103 2002   45.493  -73.55708 122 0
                103 2003   45.493  -73.55708 333 0
                end
                Code:
                save "dataex.dta", replace
                
                bysort id lat lng: keep if _n == 1
                by id: assert _n == 1
                keep id lat lng treatment
                
                save "dataex1.dta", replace
                
                rename * =0
                cross using "dataex1.dta"
                geodist lat lng lat0 lng0, gen(d)
                sort id0 id
                and then I'll get the result:

                Code:
                 list
                
                     +---------------------------------------------------------------------------------------+
                     | id0      lat0       lng0   treatm~0    id       lat        lng   treatm~t           d |
                     |---------------------------------------------------------------------------------------|
                  1. | 100   45.4915   -73.5836          1   100   45.4915   -73.5836          1           0 |
                  2. | 100   45.4915   -73.5836          1   101   45.5045   -73.5631          0   2.1570044 |
                  3. | 100   45.4915   -73.5836          1   102   45.4729   -73.6175          1   3.3614897 |
                  4. | 100   45.4915   -73.5836          1   103    45.493   -73.5571          0   2.0788067 |
                  5. | 101   45.5045   -73.5631          0   100   45.4915   -73.5836          1   2.1570044 |
                     |---------------------------------------------------------------------------------------|
                  6. | 101   45.5045   -73.5631          0   101   45.5045   -73.5631          0           0 |
                  7. | 101   45.5045   -73.5631          0   102   45.4729   -73.6175          1   5.5150794 |
                  8. | 101   45.5045   -73.5631          0   103    45.493   -73.5571          0   1.3646901 |
                  9. | 102   45.4729   -73.6175          1   100   45.4915   -73.5836          1   3.3614897 |
                 10. | 102   45.4729   -73.6175          1   101   45.5045   -73.5631          0   5.5150794 |
                     |---------------------------------------------------------------------------------------|
                 11. | 102   45.4729   -73.6175          1   102   45.4729   -73.6175          1           0 |
                 12. | 102   45.4729   -73.6175          1   103    45.493   -73.5571          0   5.2238443 |
                 13. | 103    45.493   -73.5571          0   100   45.4915   -73.5836          1   2.0788067 |
                 14. | 103    45.493   -73.5571          0   101   45.5045   -73.5631          0   1.3646901 |
                 15. | 103    45.493   -73.5571          0   102   45.4729   -73.6175          1   5.2238443 |
                     |---------------------------------------------------------------------------------------|
                 16. | 103    45.493   -73.5571          0   103    45.493   -73.5571          0           0 |
                     +---------------------------------------------------------------------------------------+
                save this file in a different name and then load the original data to merge each other.

                Code:
                save "dataex1_crossed.dta"
                use "dataex.dta", clear
                merge m:m id using "dataex1_crossed.dta"
                
                sort id year id0
                The result is:

                Code:
                list
                
                     +--------------------------------------------------------------------------------------------------------------------+
                     |  id   year       lat        lng   price   treatm~t   id0      lat0       lng0   treatm~0           d        _merge |
                     |--------------------------------------------------------------------------------------------------------------------|
                  1. | 100   2000   45.4915   -73.5836      57          1   103    45.493   -73.5571          0   2.0788067   matched (3) |
                  2. | 100   2001   45.4915   -73.5836      53          1   100   45.4915   -73.5836          1           0   matched (3) |
                  3. | 100   2002   45.4915   -73.5836      22          1   101   45.5045   -73.5631          0   2.1570044   matched (3) |
                  4. | 100   2002   45.4915   -73.5836      22          1   102   45.4729   -73.6175          1   3.3614897   matched (3) |
                  5. | 101   2000   45.5045   -73.5631     193          0   102   45.4729   -73.6175          1   5.5150794   matched (3) |
                     |--------------------------------------------------------------------------------------------------------------------|
                  6. | 101   2001   45.5045   -73.5631     111          0   100   45.4915   -73.5836          1   2.1570044   matched (3) |
                  7. | 101   2002   45.5045   -73.5631     133          0   101   45.5045   -73.5631          0           0   matched (3) |
                  8. | 101   2002   45.5045   -73.5631     133          0   103    45.493   -73.5571          0   1.3646901   matched (3) |
                  9. | 102   2001   45.4729   -73.6175     237          1   102   45.4729   -73.6175          1           0   matched (3) |
                 10. | 102   2002   45.4729   -73.6175     256          1   100   45.4915   -73.5836          1   3.3614897   matched (3) |
                     |--------------------------------------------------------------------------------------------------------------------|
                 11. | 102   2002   45.4729   -73.6175     256          1   101   45.5045   -73.5631          0   5.5150794   matched (3) |
                 12. | 102   2002   45.4729   -73.6175     256          1   103    45.493   -73.5571          0   5.2238443   matched (3) |
                 13. | 103   2002    45.493   -73.5571     122          0   102   45.4729   -73.6175          1   5.2238443   matched (3) |
                 14. | 103   2003    45.493   -73.5571     333          0   100   45.4915   -73.5836          1   2.0788067   matched (3) |
                 15. | 103   2003    45.493   -73.5571     333          0   101   45.5045   -73.5631          0   1.3646901   matched (3) |
                     |--------------------------------------------------------------------------------------------------------------------|
                 16. | 103   2003    45.493   -73.5571     333          0   103    45.493   -73.5571          0           0   matched (3) |
                     +--------------------------------------------------------------------------------------------------------------------+
                After all the steps, I see that there are all the possible ID pairs and I didn't lose any original ID*Year observations.
                However, I noticed that I couldn't declare this data set as panel in Stata by using
                Code:
                xtset id year
                since there exist duplicates.
                If I drop the duplicates to set the data as panel, I will lose some pairwise combinations (some observations).
                Is there any way to overcome this issue?

                Best,
                Ryan
                Last edited by Ryan Kim; 12 Sep 2018, 18:55.

                Comment

                Working...
                X