Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping specific addresses beginning with '#' or '$'

    Hello,
    I am trying to clean up a large dataset of addresses, where there are several different versions of the same address (e.g. '101 Main St' can also be listed as '101 Main Street') but they are associated with the same individual. I want to create a consolidated dataset with consistent addresses.

    There are some addresses that I need to drop because they are not actual street addresses - e.g. '*103', '*Dept 164'. Many of these erroneous addresses begin with an asterisk, others with alternative symbols.
    Is there a way to selectively drop those with symbology or do I need to do something like 'split address, generate (new)' and proceed from there?

    Thank you!

  • #2
    Here's one way to detect those addresses:
    Code:
    replace address = strltrim(address) // leading blanks
    gen CouldBeBad = inlist(substr(address,1,1) , "*", *&", *%*, */") // bad character list
    Rather than drop those observations, I'm suggesting the safer approach is to tag (0/1) them so that you can inspect at least some of them.

    For work like this, you'll want to learn about the large collection of useful string functions that Stata offers. See -help string functions-.

    Many experienced users would resort to the relatively sophisticated string functions known as "regular expressions," but if you're new to this kind of thing, I wouldn't go there.

    Comment


    • #3
      I think the generate command in post 2 had some quotation marks mangled. Odd, the first time I corrected them in this post, the typos returned when I previewed my work.
      Code:
      gen CouldBeBad = inlist(substr(address,1,1) , "*", "&", "%", "/") // bad character list

      Comment

      Working...
      X