Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing values by number of characters - string or numeric format

    I am working with zip codes (variable "ZIP"), but I have many incomplete zip code values that I need to replace with 99999. This is how I will track the unknown zip codes in my data set.

    Most of the incomplete zip codes are simply fewer characters (e.g., 1, 222, 4564), so I am trying to find a way to use the replace function while specifying that any values with fewer than 5 characters (the full length of zip codes) are replaced with 99999.

    My original data was saved as string, but I eventually am converting to numeric, which is a step I know how to do. It doesn't matter to me if this data cleaning step happens before or after the conversion to numeric, but it has to be based on character length.

    I was previously using the replace command as follows:

    replace ZIP="99999" if ZIP=="" | ZIP=="00000" | ZIP=="XXXXX" | ZIP=="0" | ZIP=="00" | ZIP=="000" | ZIP=="0000"
    but obviously this is a slower way to do this. I don't just need the "0"'s changed to 99999, I need the "1"'s, "2"s, etc... changed as well. I have read all about the length() command and thought that might help me, but haven't been able to figure this out yet.

    Thank you for the help.




  • #2
    Erin,

    I think the length() function is indeed what you want:

    Code:
    replace ZIP="99999" if length(ZIP)<5
    You definitely want to do this before converting to numeric because there are zip codes with leading zeros that will become 3- or 4-digit numbers after conversion.

    Regards,
    Joe

    Comment


    • #3
      Code:
      replace ZIP = "99999" if (strlen(ZIP) < 5)
      There's some possibility you might have leading or trailing blanks on ZIP, so it would be a little safer to do:
      Code:
      replace ZIP = "99999" if (strlen(strtrim(ZIP)) < 5)
      I'd suggest you check out -help string functions- if you have not seen it already.



      Comment


      • #4
        Before tampering with zip, create a new variable to avoid burning your bridges:
        Code:
        gen len=length(zip)
        gen newzip=zip
        replace newzip = "99999" if len<5

        Comment


        • #5
          Following up on Joe's post, I wonder if any of the ZIP codes that are shorter than 5 characters weren't the result of a prior conversion of ZIP codes from string to numeric, causing the loss of leading zeroes. If the data lacks any ZIP codes beginning with zeroes, I'd be particularly concerned, since most ZIP codes in the Northeastern part of the country begin with zero.

          Comment


          • #6
            Hi all - Thanks very much for the help.

            William - Thanks for asking about that! FYI - I imported the ZIP data from a text file that had fixed columns widths using the infix command. The ZIP's were imported as string. We decided to keep these in string format for that very reason (although we are using Wisconsin data, many NE residents purchased licenses so we do have some 00***, etc... zips). The incomplete ZIP's are simply part of the data set I am working with (there are 19million+ observations and fewer than 10,000 missing ZIP's so we are okay with this from research standpoint). The incompleteness is a product of how they were entered or how they were first written. These data come from fishing license sales and some people didn't necessarily write their complete ZIP or entered them completely in the government system. Or they are from another country, which we aren't studying. Anyways, just wanted to explain and thanks for the always helpful reminder about string --> numeric data lose with leading zeroes!

            Comment

            Working...
            X