Finding minimum distance between counties

Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#1

Finding minimum distance between counties

12 Jul 2022, 18:12

Hello,

My dataset has three columns. In column 1 I have county1 which is a set of a fixed number of counties ( 100) out of all the census counties. Column 3 is county2 which has all the counties from the census, unlike the county1 column. And, in column 2 I have the distance between the fixed number of counties in a column of county1 and all the counties from the census in county2.

My goal is to find the minimum distance of the unique county present in column 3 from any of the fixed counties in column 1.

Like in the 3rd and 4th line of data in the county2 column the county 10001 shows up twice and the minimum distance for that county is 30.357154190846018 from the county 10005 in the county1 column. County 12121 and 12125 also show up twice in the county2 column.

I want my desired variable to show up the minimum distance for each unique county from the county2 column except for those counties which are present in the county1 column ( 100 fixed counties in county1 column). Additionally, in that desired variable I want the value of that variable 10 for the counties present in the county1 column.

May I have some guidance how I can do it? Honestly, given my experience, this task is out of my league. Would definitely appreciate some suggestions.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long county1 double mi_to_county long county2 10001 30.357154190846018 10005 10001 33.93271490946873 10003 10003 33.93271490946873 10001 10005 30.357154190846018 10001 12001 21.94139189527779 12007 12001 26.160991422024676 12125 12001 26.527847890310913 12041 12001 36.84643076250594 12019 12001 37.26131475453462 12075 12001 37.32388476194004 12107 12001 37.74940321571971 12083 12001 40.91842391588499 12023 12001 44.941311511693854 12003 12001 50.78094710678782 12029 12001 52.02776446747155 12121 12001 53.80338887869782 12067 12001 58.37400366291111 12017 12003 19.06133167684259 12125 12003 20.428353446347472 12023 12003 27.31084808096766 12007 12003 32.92368786810345 13049 12003 35.31646691638998 12019 12003 37.44802004706887 12089 12003 39.019664192183974 12031 12003 40.33782725115972 12047 12003 41.39883902283289 13101 12003 42.25346062002965 12121 12003 44.941311511693854 12001 12003 47.39732770502037 13065 12003 50.68954033379425 13299 12003 50.94956183265025 12041 12003 56.91760482788743 13039 end

Last edited by Tariq Abdullah; 12 Jul 2022, 18:36.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

12 Jul 2022, 19:15

I am not entirely sure I have understood what you want. But I think it is this:

Code:

by county2 (mi_to_county), sort: gen wanted = mi_to_county[1] rangestat (count) in_county1 = mi_to_county, interval(county1 county2 county2) replace in_county1 = min(1, max(in_county1, 0)) replace wanted = 10 if in_county1

Note: -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.
1 like
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#3

12 Jul 2022, 19:32

Sincerest form of gratitude for articulating such a complicated problem in a way so that a naive coder like me understands it. Much obliged again for such a wonderful and meaningful contribution!
Comment

Tariq Abdullah

Join Date: Apr 2021
Posts: 366

12 Jul 2022, 19:40

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long county1 double mi_to_county long county2
47065 59.098535970592735  1049
47065 51.707425409947035  1071
22017  58.21215376963558  5027
47157 23.789068473129394  5035
47157     50.12913940926  5037
22017  48.45715761171868  5073
47157  57.29801020252582  5077
22017  50.33350640401567  5091
47157 41.246427024792865  5093
47157  51.62385708227437  5111
47157 49.278375962974096  5123
36081  46.40304309204033  9001
36071  50.22042149924045  9001
34013 55.046735839161805  9001
44003  51.81071876082283  9007
44003  30.52177924192761  9011
44003  41.32820154198425  9013
44003 23.659400946894422  9015
24003  57.19865840354873 10001
42101  40.40149007443556 10003
24003 25.296911561215097 11001
end

As much as I appreciate the help, the issue arises after implementing the code that the county 10001 in county2 column is giving the minimum distance as 57.19865840354873 miles which doesn't reflect the data if you look at the 3rd or 4th line of my #1 post where both the distance from county 10001 are respective 33 and 30 miles. So both of them are less than 57.19865840354873 miles - we've got here.

Would you be kind enough to reflect on that? Because, the way I was expecting the answer is my desired variable should have given me 30.357154190846018 in case of county 10001 in column2.

Last edited by Tariq Abdullah; 12 Jul 2022, 19:42.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30065

12 Jul 2022, 20:26

Well, if you run the code on just the example data shown in #4, you indeed get 57.198... But that is correct! Because if you just run that data, Stata has no way to know that 10001 also appears with observations in the data example in #1 that have shorter distances. But even the 30.357... is not the right answer if you put all the data together, because 10001 also appears as a value of the variable county1 when the data from #1 is taken into account, so by your rule "Additionally, in that desired variable I want the value of that variable 10 for the counties present in the county1 column" the correct answer is 10. If you run the data from both examples together, that is what you get:

Code:

. * Example generated by -dataex-. For more info, type help dataex
. clear

. input long county1 double mi_to_county long county2

          county1  mi_to_co~y       county2
  1. 10001 30.357154190846018 10005
  2. 10001  33.93271490946873 10003
  3. 10003  33.93271490946873 10001
  4. 10005 30.357154190846018 10001
  5. 12001  21.94139189527779 12007
  6. 12001 26.160991422024676 12125
  7. 12001 26.527847890310913 12041
  8. 12001  36.84643076250594 12019
  9. 12001  37.26131475453462 12075
 10. 12001  37.32388476194004 12107
 11. 12001  37.74940321571971 12083
 12. 12001  40.91842391588499 12023
 13. 12001 44.941311511693854 12003
 14. 12001  50.78094710678782 12029
 15. 12001  52.02776446747155 12121
 16. 12001  53.80338887869782 12067
 17. 12001  58.37400366291111 12017
 18. 12003  19.06133167684259 12125
 19. 12003 20.428353446347472 12023
 20. 12003  27.31084808096766 12007
 21. 12003  32.92368786810345 13049
 22. 12003  35.31646691638998 12019
 23. 12003  37.44802004706887 12089
 24. 12003 39.019664192183974 12031
 25. 12003  40.33782725115972 12047
 26. 12003  41.39883902283289 13101
 27. 12003  42.25346062002965 12121
 28. 12003 44.941311511693854 12001
 29. 12003  47.39732770502037 13065
 30. 12003  50.68954033379425 13299
 31. 12003  50.94956183265025 12041
 32. 12003  56.91760482788743 13039
 33. 47065 59.098535970592735  1049
 34. 47065 51.707425409947035  1071
 35. 22017  58.21215376963558  5027
 36. 47157 23.789068473129394  5035
 37. 47157     50.12913940926  5037
 38. 22017  48.45715761171868  5073
 39. 47157  57.29801020252582  5077
 40. 22017  50.33350640401567  5091
 41. 47157 41.246427024792865  5093
 42. 47157  51.62385708227437  5111
 43. 47157 49.278375962974096  5123
 44. 36081  46.40304309204033  9001
 45. 36071  50.22042149924045  9001
 46. 34013 55.046735839161805  9001
 47. 44003  51.81071876082283  9007
 48. 44003  30.52177924192761  9011
 49. 44003  41.32820154198425  9013
 50. 44003 23.659400946894422  9015
 51. 24003  57.19865840354873 10001
 52. 42101  40.40149007443556 10003
 53. 24003 25.296911561215097 11001
 54. end

.
.
. by county2 (mi_to_county), sort: gen wanted = mi_to_county[1]

.
. rangestat (count) in_county1 = mi_to_county, interval(county1 county2 county2)

.
. replace in_county1 = min(1, max(in_county1, 0))
(50 real changes made)

.
. replace wanted = 10 if in_county1
(8 real changes made)

.
. list if county2 == 10001

     +---------------------------------------------------+
     | county1   mi_to_c~y   county2   wanted   in_cou~1 |
     |---------------------------------------------------|
 19. |   10005   30.357154     10001       10          1 |
 20. |   10003   33.932715     10001       10          1 |
 21. |   24003   57.198658     10001       10          1 |
     +---------------------------------------------------+

The problem you have stated requires Stata to consider the relationship of each value of county2 with every value in county1. So the code will only work correctly if run on the complete data set all at once.

Last edited by Clyde Schechter; 12 Jul 2022, 20:29.

Comment

Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#6

12 Jul 2022, 21:46

my apologies! You are absolutely right. I just made a mistake. Again, sincerest apology for my silly mistake. Appreciate your patience !
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#7

13 Jul 2022, 15:44

Since I need to match it with my master data I need one column of the county to be unique. Therefore, I need each county in column 3 to be unique ( to appear once ). So, in the following data after following your suggestion of coding I'm aiming to get the lowest distance for each unique county present in column3 ( county2 column).

For 1003 county in column 2 I need to find the lowest distance 26.82 and let the other observation drop where it shows the distance to be 53.81 ish.

Looking forward to hearing the suggestion at your convenience.

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float county1 double mi_to_county float county2 1097 26.82936348115264 1003 28059 53.813340270973455 1003 1097 90.31873495653468 1023 1097 72.67731880449458 1025 28059 97.14158277226858 1025 1097 87.88316958111788 1035 1097 68.02629517720167 1053 28059 97.5847270464126 1053 47037 94.539406886108 1083 47037 97.95536530704817 1089 28059 29.63931810659627 1097 1097 78.38392189275584 1099 1097 49.91413109832283 1129 28059 70.05792107648765 1129 end
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#8

13 Jul 2022, 16:15

This following command did the trick. However, Still couldn't figure out how I can keep the county1 column. If there is anyway I can keep the county1 column , I'd definitely appreciate that advice! However, the result I got works for me too.

Code:

collapse (min) mi_to_county, by(county2)
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#9

13 Jul 2022, 16:32

Add county1 to by()
1 like
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#10

13 Jul 2022, 20:21

I tried this command - but since there appears the problem of showing up same county in column3 of county2. Only if I use the command - collapse (min) mi_to_county, by(county2)- then the problem of repeated observation of county2 doesn't show up more than one time. But, then I lose the county1 column - which I need

Can I use drop duplicates based on county2 column where out of the multiple times single county2 shows up , only the lowest distance observation remains ( from mi_to_county). As a result, like in the second line of my sample data with county2 of 1003 and county1 of 28059 drops.

Code:

collapse (min) mi_to_county, by(county2 county1)

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float county1 double mi_to_county float county2 1097 26.82936348115264 1003 28059 53.813340270973455 1003 1097 90.31873495653468 1023 1097 72.67731880449458 1025 28059 97.14158277226858 1025 1097 87.88316958111788 1035 1097 68.02629517720167 1053 28059 97.5847270464126 1053 47037 94.539406886108 1083 28151 42.09687361728982 5041 28149 98.86279621352594 5043 28151 49.853162401364926 5043 47157 74.81813629231492 5055 end

Last edited by Tariq Abdullah; 13 Jul 2022, 20:26.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#11

14 Jul 2022, 15:58

-collapse- cannot do what you want here. This will:

Code:

by county2 (mi_to_county), sort: keep if _n == 1

Added: Given the extravagant number of decimal places in the variable mi_to_county, this is unlikely to actually happen in your data, but in general you should bear in mind that whenever you say the smallest (or largest, or most frequent, or most whatever), there may be ties, and your question is not fully posed until you specify how such ties are to be broken. And it may matter, as here, because different values of county1 would result if there are two county1's that have the same value of mi_to_county with county2, and that value happens to be the lowest of all for county1. (As I say, that probably won't happen in your data, but it could.) The code above breaks such ties at random, and irreproducibly.

Last edited by Clyde Schechter; 14 Jul 2022, 16:01.
1 like
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#12

14 Jul 2022, 16:40

Wow so informative and helpful! Highly obliged to you Mr. Schecchter and the rest of the benevolent statalist community!
Comment

Announcement