Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract county name only from address?

    Hi, I am struggling with extracting county name from the address using split command.
    I am not sure if the split would work for achieving my goal.
    Has anyone tried to solve the same issue?

    My data looks like this.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str58 address
    "4376 BRYAN STREET  GREENWOOD, FL 32443"              
    "8126 HIGHWAY 90  SNEADS, FL 32460"                  
    "2110 E INTERNATIONAL SPEEDWAY BLVD  DELAND, FL 32724"
    "800 WEST BROADWAY  OVIEDO, FL 32765"                
    "4835 S   STATE ROAD 7  FORT LAUDERDALE, FL 33314"    
    "1555 CHENEY HWY  TITUSVILLE, FL 32780"              
    "7297 RIDGE ROAD  PORT RICHEY, FL 34668"              
    "525 E UNIVERSITY BLVD  MELBOURNE, FL 32901"          
    "400 EAST STATE ROAD 434  LONGWOOD, FL 32750"        
    "3850 N HIGHWAY 19 A  MOUNT DORA, FL 32757"          
    "2640 EAST ORANGE  AVENUE  EUSTIS, FL 32726"          
    "1377 BRICKYARD ROAD SUITE 5  CHIPLEY, FL 32428"      
    "5480 DOGWOOD ST  MILTON, FL 32570"                  
    "951 PRIM AVENUE  GRACEVILLE, FL 32440"              
    "8303 W ATLANTIC BLVD  CORAL SPRINGS, FL 33071"      
    "3515 ELKCAM BOULEVARD  CHIPLEY, FL 32428"            
    "100 EAST 10TH STREET  GREENSBORO, FL 32330"          
    "35922 HIGHWAY 27 NORTH  HAINES CITY, FL 33844"      
    "7050 N CHURCH AVENUE  MULBERRY, FL 33860"            
    "7241 STATE ROAD 54  NEW PORT RICHEY, FL 34653"      
    end
    Ideally, I would like to extract county name as "GREENWOOD", "FORT LAUDERDALE", and "NEW PORT RICHEY"
    Last edited by Priver JM; 08 Jun 2021, 22:02.

  • #2
    The easiest way I see to extract these without too much manual editing is to exploit the fact that the county names (at least in the data example) are always preceded by two spaces and succeeded by a comma. This regular expression extracts strings which match that description:

    Code:
    gen county = ustrregexs(4) if ustrregexm(address,"([aA-zZ0-9].+((  )+))([aA-zZ]*( [aA-zZ]*)*)(,)")
    In the data example, this approach works 100% of the time:

    Code:
    . list
    
         +------------------------------------------------------------------------+
         |                                              address            county |
         |------------------------------------------------------------------------|
      1. |               4376 BRYAN STREET  GREENWOOD, FL 32443         GREENWOOD |
      2. |                    8126 HIGHWAY 90  SNEADS, FL 32460            SNEADS |
      3. | 2110 E INTERNATIONAL SPEEDWAY BLVD  DELAND, FL 32724            DELAND |
      4. |                  800 WEST BROADWAY  OVIEDO, FL 32765            OVIEDO |
      5. |     4835 S   STATE ROAD 7  FORT LAUDERDALE, FL 33314   FORT LAUDERDALE |
         |------------------------------------------------------------------------|
      6. |                1555 CHENEY HWY  TITUSVILLE, FL 32780        TITUSVILLE |
      7. |               7297 RIDGE ROAD  PORT RICHEY, FL 34668       PORT RICHEY |
      8. |           525 E UNIVERSITY BLVD  MELBOURNE, FL 32901         MELBOURNE |
      9. |          400 EAST STATE ROAD 434  LONGWOOD, FL 32750          LONGWOOD |
     10. |            3850 N HIGHWAY 19 A  MOUNT DORA, FL 32757        MOUNT DORA |
         |------------------------------------------------------------------------|
     11. |           2640 EAST ORANGE  AVENUE  EUSTIS, FL 32726            EUSTIS |
     12. |       1377 BRICKYARD ROAD SUITE 5  CHIPLEY, FL 32428           CHIPLEY |
     13. |                    5480 DOGWOOD ST  MILTON, FL 32570            MILTON |
     14. |                951 PRIM AVENUE  GRACEVILLE, FL 32440        GRACEVILLE |
     15. |        8303 W ATLANTIC BLVD  CORAL SPRINGS, FL 33071     CORAL SPRINGS |
         |------------------------------------------------------------------------|
     16. |             3515 ELKCAM BOULEVARD  CHIPLEY, FL 32428           CHIPLEY |
     17. |           100 EAST 10TH STREET  GREENSBORO, FL 32330        GREENSBORO |
     18. |        35922 HIGHWAY 27 NORTH  HAINES CITY, FL 33844       HAINES CITY |
     19. |             7050 N CHURCH AVENUE  MULBERRY, FL 33860          MULBERRY |
     20. |        7241 STATE ROAD 54  NEW PORT RICHEY, FL 34653   NEW PORT RICHEY |
         +------------------------------------------------------------------------+
    However, if this produces too many errors in the real data or if the pattern of two spaces is not universal, an alternative approach is to download a dataset of county names, merge it with your dataset, use strpos() to identify which position a given county name appears in in the address, and then extract it.
    Last edited by Ali Atia; 08 Jun 2021, 22:38.

    Comment


    • #3
      I appreciate your help. That's so much helpful. This code works fairly enough for this example.
      However, there were 16 errors that were not matched to some counties containing "ST.PETERSBURG" or "OPA-LOCKA". I did not realize that was included in the sample.


      Comment


      • #4
        Code:
        gen county = ustrregexs(4) if ustrregexm(address,"([aA-zZ0-9].+((  )+))([aA-zZ].*)(,)")
        Last edited by Ali Atia; 09 Jun 2021, 08:49.

        Comment

        Working...
        X