Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting zip code from address , and then converting the zip code to tips county code

    Hello everyone,

    I've a dataset where I need to extract the zip code from the following three columns ( for each rf_id the zip code could be in any of the following three columns of address ). Is there a way for each observation in my dataset. The record_dt variable indicates date variable.

    how can I extract the zip code for each observation from the following three columns of address? Would greatly appreciate any kind direction!


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long rf_id int record_dt strL(caddress_2 caddress_3 caddress_4)
    12800340 12782 "60 STATE STREET"                       "SUITE 510"                          "BOSTON, MA 02109"     
    36250888  7003 "P.O. BOX 3511 STATION C"               "OTTAWA ONTARIO CANADA K1Y 4H7"      ""                     
    36340874  7000 "345 CALIFORNIA STREET"                 "SAN FRANCISCO, CA 94104-2675"       ""                     
    36340875  7000 "100 GALLERIA PARKWAY, STE. 1500"       "ATLANTA, GA 30339"                  ""                     
    36920537  7208 "COOKEVILLE, TN 38501"                  ""                                   ""                     
    36960452  7223 "235 EAST 42ND STREET"                  "NEW YORK, NEW YORK 10017-5755"      ""                     
    37290735  7332 "SUITE 3400"                            "SAN FRANCISCO, CA 94111-4187"       ""                     
    37370606  7348 "NEW YORK, NY 10022"                    ""                                   ""                     
    37410900 11040 "WASHINGTON, DC 20004"                  ""                                   ""                     
    37650757  7439 "1400 K STREET , N.W."                  "WASHINGTON, D.C. 20005"             ""                     
    37880668  7532 "PASADENA, CA 91101"                    ""                                   ""                     
    37880670  7532 "6633 CANOGA AVE."                      "CANOGA PARK, CA 91304"              ""                     
    37880673  7522 "WASHINGTON, D.C. 20036"                ""                                   ""                     
    37880675  7522 "530 FIFTH AVE."                        "NEW YORK, N. Y. 10036"              ""                     
    37880676  7522 "1800 PRUDENTIAL PLAZA"                 "CHICAGO, IL 60601"                  ""                     
    37880677  7522 "1350 AVENUE OF THE AMERICAS"           "NEW YORK, N. Y. 10019"              ""                     
    37880680  7522 "1776 K ST. N.W."                       "WASHINGTON, D. C. 20006"            ""                     
    37880681  7522 "CHICAGO, IL 60603"                     ""                                   ""                     
    37880682  7522 ""                                      ""                                   ""                     
    37880683  7522 "READINESS COMMAND"                     "5001 EISENHOWER AVE."               "ALEXANDRIA, VA. 22333"
    37880684  7522 "COMM."                                 "5001 EISENHOWER AVE."               "ALEXANDRIA, VA. 22333"    
    end
    format %td record_dt

  • #2
    The following extracts three types of zip codes - 5-digit US, 9-digit US, and 6-character Canadian that appear at the end of the final non-blank caddress_* variable.
    Code:
    generate zipline = caddress_4
    replace  zipline = caddress_3 if zipline==""
    replace  zipline = caddress_2 if zipline==""
    
    // The next four commands could all be combined into a single command
    // that I would never be able to read and understand
    
    local zip5 "\p{Decimal_Number}{5}"
    local zip9 "\p{Decimal_Number}{5}-\p{Decimal_Number}{4}"
    local zipc "\p{Alphabetic}\p{Decimal_Number}\p{Alphabetic} \p{Decimal_Number}\p{Alphabetic}\p{Decimal_Number}"
    generate zip = ustrregexrf(zipline, "^.*(`zip5'|`zip9'|`zipc')|$", "$1")
    
    list rf_id zipline zip, clean noobs
    Code:
    . list rf_id zipline zip, clean noobs
    
           rf_id                         zipline          zip  
        12800340                BOSTON, MA 02109        02109  
        36250888   OTTAWA ONTARIO CANADA K1Y 4H7      K1Y 4H7  
        36340874    SAN FRANCISCO, CA 94104-2675   94104-2675  
        36340875               ATLANTA, GA 30339        30339  
        36920537            COOKEVILLE, TN 38501        38501  
        36960452   NEW YORK, NEW YORK 10017-5755   10017-5755  
        37290735    SAN FRANCISCO, CA 94111-4187   94111-4187  
        37370606              NEW YORK, NY 10022        10022  
        37410900            WASHINGTON, DC 20004        20004  
        37650757          WASHINGTON, D.C. 20005        20005  
        37880668              PASADENA, CA 91101        91101  
        37880670           CANOGA PARK, CA 91304        91304  
        37880673          WASHINGTON, D.C. 20036        20036  
        37880675           NEW YORK, N. Y. 10036        10036  
        37880676               CHICAGO, IL 60601        60601  
        37880677           NEW YORK, N. Y. 10019        10019  
        37880680         WASHINGTON, D. C. 20006        20006  
        37880681               CHICAGO, IL 60603        60603  
        37880682                                               
        37880683           ALEXANDRIA, VA. 22333        22333  
        37880684           ALEXANDRIA, VA. 22333        22333

    Comment


    • #3
      Damn, that's next level stuff Mr. Lisowski! Wish I had the words to properly thank you for such generous and thoughtful help!

      Just so you know I'm truly humbled and obliged for your invaluable time and knowledge!

      Comment


      • #4
        Thank you for the kind words.

        Which is why I'm embarrassed to admit that I did not sufficiently test my code. The poor coding choice I made, and an editing error, did not affect the correct results in the example data but does if the assumption that the ZIP appears at the end of the string is violated. .

        Here is corrected code - only the generate zip command was changed, and it now uses the appropriate syntax for extracting substrings from a string.
        Code:
        generate zipline = caddress_4
        replace  zipline = caddress_3 if zipline==""
        replace  zipline = caddress_2 if zipline==""
        
        // The next four commands could all be combined into a single command
        // that I would never be able to read and understand
        
        local zip5 "\p{Decimal_Number}{5}"
        local zip9 "\p{Decimal_Number}{5}-\p{Decimal_Number}{4}"
        local zipc "\p{Alphabetic}\p{Decimal_Number}\p{Alphabetic} \p{Decimal_Number}\p{Alphabetic}\p{Decimal_Number}"
        generate zip = ustrregexs(1) if ustrregexm(zipline, "(`zip5'|`zip9'|`zipc')$")
        
        list rf_id zipline zip, clean noobs
        Here's an example of how it now performs.
        Code:
            37880681               CHICAGO, IL 60603        60603  
            37880681           CHICAGO, IL 60603 USA

        Comment


        • #5
          I know all the regular contributors on statalist aim for 100% accuracy but honestly the initial piece of coding helped me already to the point from where I could rectify the rest of the stuff.

          However, I again appreciate that you went back to the code for helping me out with the edited information and coding! To say the least, my research progress would have been six months behind had I not guided by you and other thoughtful mentors on statalist! My sincerest thanks to you and other mentors as well!

          Comment

          Working...
          X