Generating variables by selective search of variable’s entries

Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#1

Generating variables by selective search of variable’s entries

02 Oct 2014, 09:35

Dear Stata users,

I have a database about patents. In this database the individual’s address is written in the following format:
address1, address2, address3 and address4 (e.g., Euclid’s Road 15258; P. O. Box 15269; Boston, MA; MA). Information on address3 and address4 may appear with missing values.

What I want to create is a variable called state. This variable will take the name of the state given information from address2, address3 and address4.

What I have in mind is that I need to tell Stata to generate the variable given specific words found in entries of address2, address3 and address4. For the sake of an example: fill state variable with the name Massachusetts if the words Boston or MA (or both) are found at least once in address2, address3 and address4. Since the entries are not the same (i.e., Boston, MA != Boston != MA), a simple if statement does not seem to be the right way to attack this problem.

Can we somehow fix this in Stata?

Thank you for your time!

Last edited by Pantelis Kazakis; 02 Oct 2014, 09:43.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#2

02 Oct 2014, 09:40

be careful, there are states other than MA that have a town/city named "Boston"

second, you have, in address4, the state; this seems to be all you need but clearly that is not what you think; please clarify
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#3

02 Oct 2014, 09:41

Originally posted by Rich Goldstein View Post

be careful, there are states other than MA that have a town/city named "Boston"

second, you have, in address4, the state; this seems to be all you need but clearly that is not what you think; please clarify

I understand that well . Here in Columbus, OH we have Pennsylvania Road for example. I forgot to say that address4 is many times null, so one needs to fill the gap using the other information.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35651
#4

02 Oct 2014, 10:06

Be warned: these problems are usually much messier than you want to believe. If this were my problem I would look for the two-letter abbreviations to start with.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

02 Oct 2014, 10:31

I second Nick's caution about how messy this can be. I would use regular expressions to search for two-letter (uppercase) matches at the end of an address variable. I would allow for a number of trailing spaces and digits for zip codes. Since you have a variable number of address lines (variables), you would have to repeat the search. Something like:

Code:

clear
input str50 address1 str50 address2 str50 address3 str50 address4
"Euclid's Road 15258" "P. O. Box 15269" "Boston, MA"
"1 Infinite Loop" "Cupertino, CA 95014"
"Washington, DC 20001"
"Kendall Square/MIT" "209 Broadway" "Cambridge" "MA 02139"
end

* the following is from SSC; to install, type ssc install leftalign
leftalign
list

gen state = regexs(1) if regexm(address4, "([A-Z][A-Z])[0-9 ]*$")
replace state = regexs(1) if regexm(address3, "([A-Z][A-Z])[0-9 ]*$") & mi(state)
replace state = regexs(1) if regexm(address2, "([A-Z][A-Z])[0-9 ]*$") & mi(state)
replace state = regexs(1) if regexm(address1, "([A-Z][A-Z])[0-9 ]*$") & mi(state)

* the following is from SSC; to install, type ssc install listsome
listsome, random noobs

I highly recommend using listsome to check this type of data cleaning work, especially if you are working with lots of observations.

Last edited by Robert Picard; 02 Oct 2014, 10:33.

Comment

Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#6

02 Oct 2014, 10:35

OK, thank you for your suggestions.

I have tell you that specific search must be made in address4 as well. For it is not only with 2 letters (now seeing the data better I can check that), for example it can be MA.; M. A.; Boston, Massachusetts as well. So, I will try to focus on the very last address using selective search and see what happens.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#7

02 Oct 2014, 10:53

Yes, this is why it's messy. Again, use listsome to get a sense of what did not work in the first pass

Code:

listsome if mi(state), random

This gives you a random sample of observations where the initial strategy did not work. You then pick the easiest target and make another pass. Repeat until you reach your threshold of diminishing returns.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#8

02 Oct 2014, 15:29

Robert,

the code that you provided works pretty well, but as you said is not without errors. See for example this:

address1 address2 address3 address4 state

3500 DEER CREEK ROAD M/S 26U-25 PALO ALTO, CA 94304-1317 null AD

As you can see, this should have returned CA.

The problem is that the code continues to address1 and takes the last two letters (AD), while it should have stopped in address3. How can we make it to not continue after finding something starting from the very last address? Some kind of loop?

Another thing, after passing some time with the data is that some state indicators differ. Take for example Virginia, which should be VA. I found that you may also have V.A. or V. A. or even VA. all of them differ with each other.

Plus the case for Michigan is written many time as MICHI or MICHI.

Generally I would say that the more you advance in address, the more likely to get a mistake. Although the sample increases, I may need to stay at the very last level (address4).
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

02 Oct 2014, 16:19

I would switch to a strategy that looks for the State in the last address field only. There's no perfect recipe for how to proceed, you just need to carefully check the matches as you proceed. You should also get a master list of State abbreviations that you can merge with your matches to validate your results. Here are some additional techniques

Code:

clear
input str50 address1 str50 address2 str50 address3 str50 address4
"Euclid's Road 15258" "P. O. Box 15269" "Boston, MA"
"1 Infinite Loop" "Cupertino, CA 95014"
"Washington, DC 20001"
"Kendall Square/MIT" "209 Broadway" "Cambridge" "MA 02139"
"3500 DEER CREEK ROAD" "M/S 26U-25" "PALO ALTO, CA 94304-1317"
"Somewhere, V. A."
"Somewhere else, V.A"
"Here" "Ann Arbor, Michigan"
"There" "Cambridge, Massachusetts"
end

* the following is from SSC; to install, type ssc install leftalign
leftalign
list

* switch to using only the last address line
gen lastaddr = address4
replace lastaddr = address3 if mi(lastaddr)
replace lastaddr = address2 if mi(lastaddr)
replace lastaddr = address1 if mi(lastaddr)

* first attempt, 2 uppercase letters with optional space/zip code
* tweaked to allow also a "-" for zip+4
gen state = regexs(1) if regexm(lastaddr, "([A-Z][A-Z])[0-9 -]*$")

* the following is from SSC; to install, type ssc install listsome
listsome, random noobs
listsome if mi(state), random noobs

* redo but allow for a space and . ( V.A. or V. A. or VA.)
clonevar state0 = state
replace state = regexs(1) + regexs(2) if regexm(lastaddr, "([A-Z])\.? ?([A-Z])\.? ?[0-9 -]*$")
listsome if state0 != state, random noobs

* Let's look for states, fully spelled out
listsome if mi(state), random noobs
replace state0 = state
replace state = "MI" if strpos(upper(lastaddr), "MICHIGAN")
replace state = "MA" if strpos(upper(lastaddr), "MASSACHUSETTS")
listsome if state0 != state, random noobs

address1	address2	address3	address4	state
3500 DEER CREEK ROAD	M/S 26U-25	PALO ALTO, CA 94304-1317	null	AD

Announcement