Dropping observations based on the contents of a string variable

Maroun Mezher

Join Date: Jul 2023

Posts: 13
#1

Dropping observations based on the contents of a string variable

20 Jul 2023, 13:48

Hello,

I am working on data that contains both US zip codes and Canadian zip codes that start with letters (truncated to their first five characters). I want to modify my data so that only US zip codes remain, while the foreign ones are dropped. Here is an example:

Observation zip_code

1 "12345"

2 "58797"

3 "L6J 9"

4 "K9R 6"

I tried to run this code, however when it ran it dropped all observations. Is there an efficient way to do it without having to drop each observation manually? I must mention that all the zip codes are strings so that I don't lose zip codes that have leading 0s. Thanks!

Code:

drop if zip_code[0] != "0" & zip_code[0] != 1 & [...] & zip_code[0] != 9
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10281

20 Jul 2023, 14:42

Standard US ZIP codes range from "00001" to "99950", so you can look for strings that are of length 5 and lie within this range. This, however, may not be exhaustive to eliminate foreign ZIP codes.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte observation str7 zip_code
1 `""12345""'
2 `""58797""'
3 `""L6J 9""'
4 `""K9R 6""'
end

replace zip= trim(ustrregexra(zip, "[^0-9a-zA-Z]", ""))
drop if length(zip)!=5 & !inrange(real(zip), 1, 99950)

Res.:

Code:

. l

     +---------------------+
     | observ~n   zip_code |
     |---------------------|
  1. |        1      12345 |
  2. |        2      58797 |
     +---------------------+

.

Comment

Maroun Mezher

Join Date: Jul 2023

Posts: 13
#3

20 Jul 2023, 15:33

Thank you
Comment

Observation	zip_code
1	"12345"
2	"58797"
3	"L6J 9"
4	"K9R 6"

Announcement

Dropping observations based on the contents of a string variable

Comment

Comment