Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding minimum distance between counties

    Hello,

    My dataset has three columns. In column 1 I have county1 which is a set of a fixed number of counties ( 100) out of all the census counties. Column 3 is county2 which has all the counties from the census, unlike the county1 column. And, in column 2 I have the distance between the fixed number of counties in a column of county1 and all the counties from the census in county2.

    My goal is to find the minimum distance of the unique county present in column 3 from any of the fixed counties in column 1.

    Like in the 3rd and 4th line of data in the county2 column the county 10001 shows up twice and the minimum distance for that county is 30.357154190846018 from the county 10005 in the county1 column. County 12121 and 12125 also show up twice in the county2 column.

    I want my desired variable to show up the minimum distance for each unique county from the county2 column except for those counties which are present in the county1 column ( 100 fixed counties in county1 column). Additionally, in that desired variable I want the value of that variable 10 for the counties present in the county1 column.

    May I have some guidance how I can do it? Honestly, given my experience, this task is out of my league. Would definitely appreciate some suggestions.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long county1 double mi_to_county long county2
    10001 30.357154190846018 10005
    10001  33.93271490946873 10003
    10003  33.93271490946873 10001
    10005 30.357154190846018 10001
    12001  21.94139189527779 12007
    12001 26.160991422024676 12125
    12001 26.527847890310913 12041
    12001  36.84643076250594 12019
    12001  37.26131475453462 12075
    12001  37.32388476194004 12107
    12001  37.74940321571971 12083
    12001  40.91842391588499 12023
    12001 44.941311511693854 12003
    12001  50.78094710678782 12029
    12001  52.02776446747155 12121
    12001  53.80338887869782 12067
    12001  58.37400366291111 12017
    12003  19.06133167684259 12125
    12003 20.428353446347472 12023
    12003  27.31084808096766 12007
    12003  32.92368786810345 13049
    12003  35.31646691638998 12019
    12003  37.44802004706887 12089
    12003 39.019664192183974 12031
    12003  40.33782725115972 12047
    12003  41.39883902283289 13101
    12003  42.25346062002965 12121
    12003 44.941311511693854 12001
    12003  47.39732770502037 13065
    12003  50.68954033379425 13299
    12003  50.94956183265025 12041
    12003  56.91760482788743 13039
    end
    Last edited by Tariq Abdullah; 12 Jul 2022, 18:36.

  • #2
    I am not entirely sure I have understood what you want. But I think it is this:
    Code:
    by county2 (mi_to_county), sort: gen wanted = mi_to_county[1]
    
    rangestat (count) in_county1 = mi_to_county, interval(county1 county2 county2)
    
    replace in_county1 = min(1, max(in_county1, 0))
    
    replace wanted = 10 if in_county1
    Note: -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

    Comment


    • #3
      Sincerest form of gratitude for articulating such a complicated problem in a way so that a naive coder like me understands it. Much obliged again for such a wonderful and meaningful contribution!

      Comment


      • #4

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input long county1 double mi_to_county long county2
        47065 59.098535970592735  1049
        47065 51.707425409947035  1071
        22017  58.21215376963558  5027
        47157 23.789068473129394  5035
        47157     50.12913940926  5037
        22017  48.45715761171868  5073
        47157  57.29801020252582  5077
        22017  50.33350640401567  5091
        47157 41.246427024792865  5093
        47157  51.62385708227437  5111
        47157 49.278375962974096  5123
        36081  46.40304309204033  9001
        36071  50.22042149924045  9001
        34013 55.046735839161805  9001
        44003  51.81071876082283  9007
        44003  30.52177924192761  9011
        44003  41.32820154198425  9013
        44003 23.659400946894422  9015
        24003  57.19865840354873 10001
        42101  40.40149007443556 10003
        24003 25.296911561215097 11001
        end
        As much as I appreciate the help, the issue arises after implementing the code that the county 10001 in county2 column is giving the minimum distance as 57.19865840354873 miles which doesn't reflect the data if you look at the 3rd or 4th line of my #1 post where both the distance from county 10001 are respective 33 and 30 miles. So both of them are less than 57.19865840354873 miles - we've got here.

        Would you be kind enough to reflect on that? Because, the way I was expecting the answer is my desired variable should have given me 30.357154190846018 in case of county 10001 in column2.
        Last edited by Tariq Abdullah; 12 Jul 2022, 19:42.

        Comment


        • #5
          Well, if you run the code on just the example data shown in #4, you indeed get 57.198... But that is correct! Because if you just run that data, Stata has no way to know that 10001 also appears with observations in the data example in #1 that have shorter distances. But even the 30.357... is not the right answer if you put all the data together, because 10001 also appears as a value of the variable county1 when the data from #1 is taken into account, so by your rule "Additionally, in that desired variable I want the value of that variable 10 for the counties present in the county1 column" the correct answer is 10. If you run the data from both examples together, that is what you get:

          Code:
          . * Example generated by -dataex-. For more info, type help dataex
          . clear
          
          . input long county1 double mi_to_county long county2
          
                    county1  mi_to_co~y       county2
            1. 10001 30.357154190846018 10005
            2. 10001  33.93271490946873 10003
            3. 10003  33.93271490946873 10001
            4. 10005 30.357154190846018 10001
            5. 12001  21.94139189527779 12007
            6. 12001 26.160991422024676 12125
            7. 12001 26.527847890310913 12041
            8. 12001  36.84643076250594 12019
            9. 12001  37.26131475453462 12075
           10. 12001  37.32388476194004 12107
           11. 12001  37.74940321571971 12083
           12. 12001  40.91842391588499 12023
           13. 12001 44.941311511693854 12003
           14. 12001  50.78094710678782 12029
           15. 12001  52.02776446747155 12121
           16. 12001  53.80338887869782 12067
           17. 12001  58.37400366291111 12017
           18. 12003  19.06133167684259 12125
           19. 12003 20.428353446347472 12023
           20. 12003  27.31084808096766 12007
           21. 12003  32.92368786810345 13049
           22. 12003  35.31646691638998 12019
           23. 12003  37.44802004706887 12089
           24. 12003 39.019664192183974 12031
           25. 12003  40.33782725115972 12047
           26. 12003  41.39883902283289 13101
           27. 12003  42.25346062002965 12121
           28. 12003 44.941311511693854 12001
           29. 12003  47.39732770502037 13065
           30. 12003  50.68954033379425 13299
           31. 12003  50.94956183265025 12041
           32. 12003  56.91760482788743 13039
           33. 47065 59.098535970592735  1049
           34. 47065 51.707425409947035  1071
           35. 22017  58.21215376963558  5027
           36. 47157 23.789068473129394  5035
           37. 47157     50.12913940926  5037
           38. 22017  48.45715761171868  5073
           39. 47157  57.29801020252582  5077
           40. 22017  50.33350640401567  5091
           41. 47157 41.246427024792865  5093
           42. 47157  51.62385708227437  5111
           43. 47157 49.278375962974096  5123
           44. 36081  46.40304309204033  9001
           45. 36071  50.22042149924045  9001
           46. 34013 55.046735839161805  9001
           47. 44003  51.81071876082283  9007
           48. 44003  30.52177924192761  9011
           49. 44003  41.32820154198425  9013
           50. 44003 23.659400946894422  9015
           51. 24003  57.19865840354873 10001
           52. 42101  40.40149007443556 10003
           53. 24003 25.296911561215097 11001
           54. end
          
          .
          .
          . by county2 (mi_to_county), sort: gen wanted = mi_to_county[1]
          
          .
          . rangestat (count) in_county1 = mi_to_county, interval(county1 county2 county2)
          
          .
          . replace in_county1 = min(1, max(in_county1, 0))
          (50 real changes made)
          
          .
          . replace wanted = 10 if in_county1
          (8 real changes made)
          
          .
          . list if county2 == 10001
          
               +---------------------------------------------------+
               | county1   mi_to_c~y   county2   wanted   in_cou~1 |
               |---------------------------------------------------|
           19. |   10005   30.357154     10001       10          1 |
           20. |   10003   33.932715     10001       10          1 |
           21. |   24003   57.198658     10001       10          1 |
               +---------------------------------------------------+
          The problem you have stated requires Stata to consider the relationship of each value of county2 with every value in county1. So the code will only work correctly if run on the complete data set all at once.
          Last edited by Clyde Schechter; 12 Jul 2022, 20:29.

          Comment


          • #6
            my apologies! You are absolutely right. I just made a mistake. Again, sincerest apology for my silly mistake. Appreciate your patience !

            Comment


            • #7
              Since I need to match it with my master data I need one column of the county to be unique. Therefore, I need each county in column 3 to be unique ( to appear once ). So, in the following data after following your suggestion of coding I'm aiming to get the lowest distance for each unique county present in column3 ( county2 column).

              For 1003 county in column 2 I need to find the lowest distance 26.82 and let the other observation drop where it shows the distance to be 53.81 ish.

              Looking forward to hearing the suggestion at your convenience.

              ----------------------- copy starting from the next line -----------------------
              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input float county1 double mi_to_county float county2
               1097  26.82936348115264 1003
              28059 53.813340270973455 1003
               1097  90.31873495653468 1023
               1097  72.67731880449458 1025
              28059  97.14158277226858 1025
               1097  87.88316958111788 1035
               1097  68.02629517720167 1053
              28059   97.5847270464126 1053
              47037    94.539406886108 1083
              47037  97.95536530704817 1089
              28059  29.63931810659627 1097
               1097  78.38392189275584 1099
               1097  49.91413109832283 1129
              28059  70.05792107648765 1129
              end

              Comment


              • #8
                This following command did the trick. However, Still couldn't figure out how I can keep the county1 column. If there is anyway I can keep the county1 column , I'd definitely appreciate that advice! However, the result I got works for me too.

                Code:
                collapse (min) mi_to_county, by(county2)

                Comment


                • #9
                  Add county1 to by()

                  Comment


                  • #10
                    I tried this command - but since there appears the problem of showing up same county in column3 of county2. Only if I use the command - collapse (min) mi_to_county, by(county2)- then the problem of repeated observation of county2 doesn't show up more than one time. But, then I lose the county1 column - which I need

                    Can I use drop duplicates based on county2 column where out of the multiple times single county2 shows up , only the lowest distance observation remains ( from mi_to_county). As a result, like in the second line of my sample data with county2 of 1003 and county1 of 28059 drops.
                    Code:
                     collapse (min) mi_to_county, by(county2 county1)
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input float county1 double mi_to_county float county2
                    1097 26.82936348115264 1003
                    28059 53.813340270973455 1003
                    1097 90.31873495653468 1023
                    1097 72.67731880449458 1025
                    28059 97.14158277226858 1025
                    1097 87.88316958111788 1035
                    1097 68.02629517720167 1053
                    28059 97.5847270464126 1053
                    47037 94.539406886108 1083
                    28151 42.09687361728982 5041
                    28149 98.86279621352594 5043
                    28151 49.853162401364926 5043
                    47157 74.81813629231492 5055
                    end
                    Last edited by Tariq Abdullah; 13 Jul 2022, 20:26.

                    Comment


                    • #11
                      -collapse- cannot do what you want here. This will:
                      Code:
                      by county2 (mi_to_county), sort: keep if _n == 1
                      Added: Given the extravagant number of decimal places in the variable mi_to_county, this is unlikely to actually happen in your data, but in general you should bear in mind that whenever you say the smallest (or largest, or most frequent, or most whatever), there may be ties, and your question is not fully posed until you specify how such ties are to be broken. And it may matter, as here, because different values of county1 would result if there are two county1's that have the same value of mi_to_county with county2, and that value happens to be the lowest of all for county1. (As I say, that probably won't happen in your data, but it could.) The code above breaks such ties at random, and irreproducibly.
                      Last edited by Clyde Schechter; 14 Jul 2022, 16:01.

                      Comment


                      • #12
                        Wow so informative and helpful! Highly obliged to you Mr. Schecchter and the rest of the benevolent statalist community!

                        Comment

                        Working...
                        X