Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing a list of string from mailing address

    Hi all,

    I have a database of mailing address that I need to do some cleaning on it. Some of the addresses come with some information (e.g. STE 240 , RM A145 , BLDG # 500, FL 876, # 2B, etc.) that I need to put them in a separate column. I am using Stata 14.2 and would greatly appreciate any help with this.

    Thanks so much in advance,
    Z

  • #2
    Please provide a data example using the -dataex- command. Without more details, which will be provided by such an example, it is very difficult to offer any useful advice.

    Comment


    • #3
      Sure. Here is a sample data. Just to add that, I want to put those data into the "additional_info" column. Thanks again!
      Code:
       
      clear
      input str29 address str29 zipcode str29 additional_info
      "413 Fulton St STE 240" "61354" ""    
      "18527 Jersey Ave" "93245" ""    
      "24816 Lagos Rd" "56438" ""
      "265 Township Line Rd  # 2B" "19027" ""    
      "2609 W 19th St FL 876" "79072" ""    
      "8604 47th St" "60534" ""    
      "270 Mount Zion Rd" "29477" ""    
      "600 Northpointe Ln BLDG 500" "24540" ""
      end

      Comment


      • #4
        Thanks for the data example.

        Code:
        replace additional_info = ustrregexs(3) if ustrregexm(address,"(.*)( )(([A-Z]{2,}|\W).*)")
        replace address = strtrim(subinstr(address,additional_info,"",1))
        . list, div sep(0)
        
             +-------------------------------------------+
             |              address | zipcode | additi~o |
             |----------------------+---------+----------|
          1. |        413 Fulton St |   61354 |  STE 240 |
          2. |     18527 Jersey Ave |   93245 |          |
          3. |       24816 Lagos Rd |   56438 |          |
          4. | 265 Township Line Rd |   19027 |     # 2B |
          5. |       2609 W 19th St |   79072 |   FL 876 |
          6. |         8604 47th St |   60534 |          |
          7. |    270 Mount Zion Rd |   29477 |          |
          8. |   600 Northpointe Ln |   24540 | BLDG 500 |
             +-------------------------------------------+
        This operates on the basis that the unwanted strings have either ≥2 capital letters or a non-alphanumeric character at the start, i.e. BLDG 500 or # 2B, and that no characters in the actual address match that pattern.
        Last edited by Ali Atia; 18 Jun 2021, 21:11.

        Comment


        • #5
          Wow! This worked perfectly! Thanks so much, Ali!

          Comment


          • #6
            variants:
            Code:
            gen additional_info2 = ustrtrim(ustrregexs(0)) if ustrregexm(address,"\s(\p{Lu}{2,}|#).*?$")
                
            gen additional_info3 = ustrtrim(ustrregexs(0)) if ustrregexm(address,"\s(STE|FL|BLDG|#).*?$")

            Comment

            Working...
            X