Dropping specific addresses beginning with '#' or '$'

Devin Parker

Join Date: Oct 2022

Posts: 13
#1

Dropping specific addresses beginning with '#' or '$'

04 Dec 2022, 15:05

Hello,
I am trying to clean up a large dataset of addresses, where there are several different versions of the same address (e.g. '101 Main St' can also be listed as '101 Main Street') but they are associated with the same individual. I want to create a consolidated dataset with consistent addresses.

There are some addresses that I need to drop because they are not actual street addresses - e.g. '*103', '*Dept 164'. Many of these erroneous addresses begin with an asterisk, others with alternative symbols.
Is there a way to selectively drop those with symbology or do I need to do something like 'split address, generate (new)' and proceed from there?

Thank you!
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#2

04 Dec 2022, 15:49

Here's one way to detect those addresses:

Code:

replace address = strltrim(address) // leading blanks gen CouldBeBad = inlist(substr(address,1,1) , "*", *&", *%*, */") // bad character list

Rather than drop those observations, I'm suggesting the safer approach is to tag (0/1) them so that you can inspect at least some of them.

For work like this, you'll want to learn about the large collection of useful string functions that Stata offers. See -help string functions-.

Many experienced users would resort to the relatively sophisticated string functions known as "regular expressions," but if you're new to this kind of thing, I wouldn't go there.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

04 Dec 2022, 16:15

I think the generate command in post 2 had some quotation marks mangled. Odd, the first time I corrected them in this post, the typos returned when I previewed my work.

Code:

gen CouldBeBad = inlist(substr(address,1,1) , "*", "&", "%", "/") // bad character list
2 likes
Comment

Announcement

Dropping specific addresses beginning with '#' or '$'

Comment

Comment