Replacing values by number of characters - string or numeric format

Erin Burkett

Join Date: Aug 2015

Posts: 16
#1

Replacing values by number of characters - string or numeric format

27 Aug 2015, 06:50

I am working with zip codes (variable "ZIP"), but I have many incomplete zip code values that I need to replace with 99999. This is how I will track the unknown zip codes in my data set.

Most of the incomplete zip codes are simply fewer characters (e.g., 1, 222, 4564), so I am trying to find a way to use the replace function while specifying that any values with fewer than 5 characters (the full length of zip codes) are replaced with 99999.

My original data was saved as string, but I eventually am converting to numeric, which is a step I know how to do. It doesn't matter to me if this data cleaning step happens before or after the conversion to numeric, but it has to be based on character length.

I was previously using the replace command as follows:

replace ZIP="99999" if ZIP=="" | ZIP=="00000" | ZIP=="XXXXX" | ZIP=="0" | ZIP=="00" | ZIP=="000" | ZIP=="0000"

but obviously this is a slower way to do this. I don't just need the "0"'s changed to 99999, I need the "1"'s, "2"s, etc... changed as well. I have read all about the length() command and thought that might help me, but haven't been able to figure this out yet.

Thank you for the help.
Tags: None
Joe Canner

Join Date: Mar 2014

Posts: 580
#2

27 Aug 2015, 07:06

Erin,

I think the length() function is indeed what you want:

Code:

replace ZIP="99999" if length(ZIP)<5

You definitely want to do this before converting to numeric because there are zip codes with leading zeros that will become 3- or 4-digit numbers after conversion.

Regards,
Joe
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#3

27 Aug 2015, 07:12

Code:

replace ZIP = "99999" if (strlen(ZIP) < 5)

There's some possibility you might have leading or trailing blanks on ZIP, so it would be a little safer to do:

Code:

replace ZIP = "99999" if (strlen(strtrim(ZIP)) < 5)

I'd suggest you check out -help string functions- if you have not seen it already.
1 like
Comment
Svend Juul

Join Date: Apr 2014

Posts: 515
#4

27 Aug 2015, 07:14

Before tampering with zip, create a new variable to avoid burning your bridges:

Code:

gen len=length(zip) gen newzip=zip replace newzip = "99999" if len<5
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

27 Aug 2015, 07:45

Following up on Joe's post, I wonder if any of the ZIP codes that are shorter than 5 characters weren't the result of a prior conversion of ZIP codes from string to numeric, causing the loss of leading zeroes. If the data lacks any ZIP codes beginning with zeroes, I'd be particularly concerned, since most ZIP codes in the Northeastern part of the country begin with zero.
1 like
Comment
Erin Burkett

Join Date: Aug 2015

Posts: 16
#6

28 Aug 2015, 09:09

Hi all - Thanks very much for the help.

William - Thanks for asking about that! FYI - I imported the ZIP data from a text file that had fixed columns widths using the infix command. The ZIP's were imported as string. We decided to keep these in string format for that very reason (although we are using Wisconsin data, many NE residents purchased licenses so we do have some 00***, etc... zips). The incomplete ZIP's are simply part of the data set I am working with (there are 19million+ observations and fewer than 10,000 missing ZIP's so we are okay with this from research standpoint). The incompleteness is a product of how they were entered or how they were first written. These data come from fishing license sales and some people didn't necessarily write their complete ZIP or entered them completely in the government system. Or they are from another country, which we aren't studying. Anyways, just wanted to explain and thanks for the always helpful reminder about string --> numeric data lose with leading zeroes!
Comment

Announcement

Replacing values by number of characters - string or numeric format

Comment

Comment

Comment

Comment

Comment