Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Match and replace on multiple strings

    I have a list of street addresses, cities, and a number that was assigned to each based (roughly) on neighborhood/area. However, some observations have missings for numbers, and I want to try to fill them in. The best way I can think of doing this is by matching on street name within a city. Of course, this is a little difficult given that a lot of the address data is fairly messy, and addresses don't follow perfect patterns in the first place.

    The first part is fairly straightforward, I just parse and then create a new variable with address2 and address3, which should (usually) be most of the street name, or at least enough to match.
    Code:
    split address
    gen new_address = address2 + " " + address3

    The second part is where I'm getting tripped up. I think I'd want something like (and I'm writing this in plain text on purpose because I'm just writing what I think I'd need in words):

    gen matched_string = 1 if [multiple observations have the same city and new_address]


    Then I could just replace number wherever I get a match:
    Code:
    gen number_2 = .
    replace number_2 = number if matched_string == 1
    If someone could help me with that middle part, I'd appreciate it. Here is some example data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str22 address str12 city byte number
    "12 Grassy Knoll Rd"     "Springfield"  13
    "62 Prince Street"       "Woodside"     42
    "8732 Rockledge Road"    "Woodside"     23
    "306 Howard Street"      "Woodside"     34
    "9453 NW Hill Field Dr." "Woodside"     22
    "123 Winter Oaks Lan"    "Springfield"   .
    "100 Bo Mountain Dr"     "Woodside"     51
    "711 Church St."         "Woodside"     13
    "9274 San Juan Court"    "Collegeville" 44
    "40 Pilgrim Drive"       "Collegeville" 78
    "93 Redwood Street"      "Collegeville" 87
    "Piper St. PO Box 10"    "Collegeville" 56
    "150 Winter Oaks Lane"   "Springfield"  76
    "1000 Prince St"         "Springfield"   .
    "5 NW Hill Field"        "Woodside"      .
    end

  • #2
    To clarify -- would you want, for instance, "123 Winter Oaks Lan" to have 76 as its number? Since "Winter Oaks" and "Springfield" match with "150 Winter Oaks Lane"? Similarly, would "5 NW Hill Field" be 22, and "1000 Prince St" be 42?

    Comment


    • #3
      Ali Atia yes, that's correct!

      Comment


      • #4
        Your parsing approach doesn't capture all of the variations in the data example -- this is a little bit of a crude approach, but it works for the example:

        Code:
        gen new_address = ustrregexra(address,"^\d*? ","")
        replace new_address = ustrregexra(upper(new_address),"( (RD|DR|ST)\.?$)|( (ROAD|DRIVE|STREET|COURT|LAN(E)?)$)","")
        bys new_address: egen min_num = min(number)
        replace number = min_num if number==.
        The amount of variations on things like "rd/road" you have in the actual dataset will determine whether it is feasible to use this approach. If you prefer to stick with your parsing method, then just skip the first two lines and use the last two to replace the missing numbers.

        Comment


        • #5
          Regarding the first part, capturing variation: yes, because I have so many observations in the real data (a couple million), I didn't even really think it's worth attempting to capture all possible variation. Plus, for my purposes, just getting the main word of the street name is probably enough--there aren't likely to be multiple streets named "Winter XYZ" in one city.

          For the second part, this is helpful, thank you--one issue (sorry this wasn't in the example data) is when the same new_address occurs in multiple cities (aka common street names like Main Street). I think I could solve this by adding 'city' before new_address in the bysort:

          Code:
          bys city new_address: egen min_num = min(number)
          Another issue I didn't actually realize before is if, for example, two observations have the same street name in the same city but different numbers, this strategy would be arbitrarily assigning the lower number. But it would also be impossible to do some "exact match" when that is the case. So I think I just have to accept some variability regardless of strategy. Thanks again!
          Last edited by Garrett Todd; 27 Jul 2022, 10:19.

          Comment

          Working...
          X