Strings: how can I deterministically impute incomplete string data using neighbouring complete records in a variable

Jamie Fagg

Join Date: Apr 2014

Posts: 6
#1

Strings: how can I deterministically impute incomplete string data using neighbouring complete records in a variable

01 May 2014, 16:03

Hello,

Please see table below. As you can see, this is a subset of address data (N~7,000 in the full dataset).

I would like to use ID2 to impute ID1 - i.e. add in 'Street' to the end of ID1.

I'm using Stata/MP 13.1 for Windows (64-bit x86-64)

UPDATE: I posted and then realised that the title might imply that all the data like this are next to each other in the dataset, they are not.

P.S. Sorry for the additional bit of table below - the empty cells aren't relevant. I can't work out how to manually save post drafts or delete tables when posting to the Forum. When I've gone to other areas of Statalist.org to try to work these things out I keep losing draft content.

Best wishes,

Jamie

Data
ID Address

1 337 W Pender

2 337 W Pender Street
Tags: None
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#2

01 May 2014, 16:43

I don't see a satisfactory way of answering the question unless you give a better explanation of the problem. Just saying that you want to "impute" one observation based on another observation doesn't seem enough. Read the advice on posting given in the FAQ.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment

Jamie Fagg

Join Date: Apr 2014
Posts: 6

01 May 2014, 17:59

Thanks for the reply Roberto. Sorry if I wasn't clear. I've tried to expand it more below.

The variable is a street address variable for Canadian addresses in British Columbia. The variable is often only partially complete (i.e. unit, building, street name, street type, locality, and province are variously included or not).

A large proportion of the observations are very similar to each other. I know from other sources that 337 W Pender (ID1) is the same as 337 W Pender Street (ID2). I can change this by hand for this case, but need a general solution which does the same thing. I've included a larger subset below to try to illustrate it a bit more. ID is self-explanatory, Address is the text as I currently have it, Desired address is the text as I'd like to have it.

I think the code would do the following two operations:

1. make a variable which has a value of 1 for observations where the following statement is true. Observation 1 - "Street" = Observation 2. This variable would equal 1 for observation IDs 1 and 5 in my data. My problem is that I can't see how to do logical comparisons between observations within the same variable.
2. insert the text "Street" after the street_name in question - so after "W Pender" in ID1 and after "Powell" in ID5. I'm assuming that this will involve the subinstr function but I can't work out how to tell it to insert it at the right position in the text.

I hope this helps, and thanks for your time.

Data

ID	Address	Desired address
1	337 W Pender	337 W Pender Street
2	337 W Pender Street	337 W Pender Street
3	Powell Street, Vancouver, BC	Powell Street, Vancouver, BC
4	1220 Granville Street	1220 Granville Street
5	Powell, Vancouver, BC	Powell Street, Vancouver, BC
6	1220 Granville Street	1220 Granville Street

Comment

Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#4

01 May 2014, 18:15

Jamie,

Thanks for stating the problem anew. You are more likely to receive help in this way.

One strategy could be to assume that the longer address is the correct one and keep that one address for all other observations, but it may not be true. Suppose there's a typo in the form of an excess character in one observation, but all other relevant observations are fine. Then using this strategy would lead you to not only keep the erroneous one, but to actually spread the error all over the database. Another limitation could involve having different parts of the complete address split up in different observations. Say one observation has the building but not the street, and vice versa for another observation. Then keeping the longest one is no good.

Maybe some user with more experience can show the way.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Jamie Fagg

Join Date: Apr 2014

Posts: 6
#5

02 May 2014, 12:49

Hi Roberto and Statalist,

Thanks for the reply and the suggestion regarding the longer address. I agree that it is initially appealing but also agree that it is perhaps too subject to introducing error - especially because the address data are of really unstructured and variable quality overall.

I think that avoiding risks similar to this may have been why I am thinking along the lines of the slightly more specific algorithm I outlined above. I think that method would reduce these risks because it would only target observations which are missing a general property like street type (i.e. substrings like 'Street' or 'Avenue) as opposed to more specific properties like street names. Once identified that this street exists elsewhere in the database, I would then insert the substring into the cell lacking it. I am hoping that once I have cleaned up the structure of the data a bit more, it will allow me to drill down into the street name errors more easily.

So, yes, any further thoughts and advice would be really appreciated

Best wishes

Jamie
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

02 May 2014, 18:15

Your data being addresses, it is likely that there are more problems than just missing classifiers like "street" and "avenue." I think I would save a file with just the addresses, drop duplicates, then sort it alphabetically, and then by hand edit it by deleting incomplete/incorrect addresses for which a correct version exists. At the end you would have a file consisting entirely of distinct, correct addresses. I would then fuzzy-merge that with your working data set using -reclink-, available from SSC (I believe Michael Blasnik wrote this--it's an excellent tool for this kind of situation.) That should get you most of the incorrect addresses matched to the corresponding correct one. You can than hand correct what few errors are likely to remain.

Hope this helps.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

03 May 2014, 12:44

The first step in cleaning up addresses like this is usually to split each part into separate variables. This can be tricky and often requires some advanced string manipulation techniques that rely on regular expressions. The following code would probably not work on the full sample but shows the gist of what needs to be done:

Code:

clear
input id str50 addr
1    "337 W Pender"
2    "337 W Pender Street"
3    "Powell Street, Vancouver, BC"
4    "1220 Granville Street"
5    "Powell, Vancouver, BC"
6    "1220 Granville Street"
6    "1220 Granville St."
end

* if all addresses are in BC, remove the redundant province code
gen addrBC = subinstr(addr,", BC","",1)

* assume that a comma separates the address from the city
split addrBC, parse(,)
rename addrBC2 city

* split the address into the civic number and street name
gen civicno = real(word(addrBC1,1))
gen street = subinstr(addrBC1,word(addrBC1,1),"",1)
replace street = addrBC1 if mi(civicno)

* split and standardize the street type
gen street2 = " " + upper(street) + " "
gen streettype = ""
replace streettype = "ST" if strpos(street2, " STREET ")
replace street2 = subinstr(street2, " STREET ", " ", 1)
replace streettype = "ST" if strpos(street2, " ST. ")
replace street2 = subinstr(street2, " ST. ", " ", 1)
replace street2 = trim(street2)

Once each original address is split into its logical parts, you can group addresses in various ways. To ignore the street type, you can:

Code:

sort city civicno street2 id
by city civicno street2: gen obs1 = _n == 1
gen addrid = sum(obs1)

list id addr addrid civicno street2 streettype city, sepby(city civicno street2) noobs

Note that it's probably easier to run the dataset through an address validation/geocoding application. The better ones return the matched address and the quality of the match (including if the best match ties with others, e.g. if "123 Powell Ave" and "123 Powell St" both exist in Vancouver). Take a look at geocode3 from SSC to give you an idea of how that works.

Comment

Kieran McCaul

Join Date: Apr 2014
Posts: 60

03 May 2014, 17:28

It's really hard to split up addresses into their component parts, particularly if they don't follow some standard format and so you end up with addresses ending in "ST", "ST.", or "STREET".

Regular expressions are the way to go here.
The following are just some examples. With a bigger list of addresses, I'd approach this differently depending on what they looked like.

Code:

clear*


input id str50 addr
1    "337 W Pender"
2    "337 W Pender Street"
3    "Powell Street, Vancouver, BC"
4    "1220 Granville Street"
5    "Powell, Vancouver, BC"
6    "1220 Granville Street"
6    "1220 Granville St."
end



gen address = upper(addr)

gen end = regexs(1) if regexm(address,"[ ]([A-Z]+)$")
tab end

* ----------------------------------------------------------------------------------------
list if end=="BC"

replace address = regexs(1) if regexm(address,"^(.+)[,?] (BC)$")

list if end=="BC"


capture drop end
gen end = regexs(1) if regexm(address,"[ ]([A-Z]+)$")
tab end

* ----------------------------------------------------------------------------------------
list if end=="VANCOUVER"

gen city = regexs(2) if regexm(address,"^(.+)[,?] (VANCOUVER)$")

replace address = regexs(1) if regexm(address,"^(.+)[,?] (VANCOUVER)$")

list if end=="VANCOUVER"

capture drop end
gen end = regexs(1) if regexm(address,"[ ]([A-Z.]+)$")
tab end


* ----------------------------------------------------------------------------------------
* Split Addresses of the form 999 XXXXXXX XX

gen numb = regexs(1) if regexm(address,"^([0-9]+)[ ]([A-Z ]+)[ ](ST.)$")
gen name = regexs(2) if regexm(address,"^([0-9]+)[ ]([A-Z ]+)[ ](ST.)$")
gen type = "STREET" if regexm(address,"^([0-9]+)[ ]([A-Z ]+)[ ](ST.)$")

replace numb = regexs(1) if regexm(address,"^([0-9]+)[ ]([A-Z ]+)[ ](STREET)$")
replace name = regexs(2) if regexm(address,"^([0-9]+)[ ]([A-Z ]+)[ ](STREET)$")
replace type = regexs(3) if regexm(address,"^([0-9]+)[ ]([A-Z ]+)[ ](STREET)$")

list

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#9

04 May 2014, 09:16

In case the OP is starting to be overwhelmed by the complexity of the suggestions so far, here's a simple solution to his initial request:

Code:

clear input id str50 addr 1 "337 W Pender" 2 "337 W Pender Street" 3 "Powell Street, Vancouver, BC" 4 "1220 Granville Street" 5 "Powell, Vancouver, BC" 6 "1220 Granville Street" end clonevar desired = addr gen asimpler = subinstr(addr," Street","",1) gen better = addr != asimpler bys asimpler (better): replace desired = addr[_N] list id addr desired, sepby(desired)

In other words, before constructing the desired version of the address, you have to first remove " Street" when present and then group the simpler but ambiguous addresses. Within the resulting groups of ambiguous addresses, you order observations so that the preferred version is last.
Comment
Svend Juul

Join Date: Apr 2014

Posts: 515
#10

04 May 2014, 14:23

Following up upon the suggestion from Clyde Schecter:

Before sorting the entries you may benefit from removing initial house numbers, so it is W Pender rather than 337 W Pender, keeping all addresses on the street together..
This can be accomplished with this code, adr being the original address variable, adr2 a modified variable without house numbers:

Code:

generate len = length(adr) // Length of address generate w1 = word(adr,1) // First word in address generate len1 = length(w1) // Length of first word generate len2 = len-len1 // Length of remaining address generate adr2=adr // Revised address variable // with first word removed if it is a number: replace adr2 = substr(adr,len1+2,len2) if !missing(real(w1))

Good luck
Comment
Jamie Fagg

Join Date: Apr 2014

Posts: 6
#11

05 May 2014, 13:10

Thank you to everyone who has replied to this since I last checked.

As there are lots of useful ideas here I am going to have a look to see if I can use them to do what I need. I'll come back to the list at a later point to feedback on progress with it all.
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#12

07 May 2014, 05:50

I would like to add my two humble comments. In my experience of working with administrative data sets doing anything by hand is extremely difficult. If your data set is derived from some administrative system that processes a vast number of claimants you will spend days trying to tidy it up by hand. My suggestions would be:
Do not delete anything.

You could estimate degree of similarity between phrases, there is famous FuzzyFind in Excel (apologies for linking the VBA code here) but I presume that developing analogous command in Stata would be easy.

Alternative suggestion would be use of external software, batch geocode your addresses, Google is usually good in deciphering location names. You can subsequently add obtained coordinates to your dataset and undertake data linkage using those coordinates. Having coordinates will let you easily estimate whether two addresses correspond to the same place or are located in vicinity.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Jamie Fagg

Join Date: Apr 2014

Posts: 6
#13

13 May 2014, 15:25

Dear all,

Many thanks for all your replies. So, my original question has been answered in the sense that I found an external geocoder which identified all the ambiguous street types and had far superior algorithms for assigning the correct ones than the approach I was trying to develop. Particular thanks to Robert for suggesting using the geocoder.

For reference: I used this geocoder developed by the Government of British Columbia which seems very comprehensive to me - with many fault reporting and accuracy variables returned as output in addition to the lat/long coordinates and other area-level units (i.e. province, locality, streeet, block, building, unit). I would recommend it for anyone in BC - http://www.data.gov.bc.ca/dbc/geogra...geocoding.page

Originally posted by Robert Picard View Post

Note that it's probably easier to run the dataset through an address validation/geocoding application. The better ones return the matched address and the quality of the match (including if the best match ties with others, e.g. if "123 Powell Ave" and "123 Powell St" both exist in Vancouver). Take a look at geocode3 from SSC to give you an idea of how that works.

I'm still working on that data which was too ambiguous for the geocoder and many thanks to all for the code and suggestions for how to proceed with this - its been helping a lot.

Many thanks again,

Jamie
Comment

Announcement