Problem in String Matching using lists

Kedar Kelkar

Join Date: Sep 2018

Posts: 47
#1

Problem in String Matching using lists

08 Jan 2019, 01:35

I have sample individual addresses data which is pretty unclean. There more than 400000 observations.

MAIN MARKET OPP.POLICE STATION, ABC STREET, DELHI-400001

ABC PRINTERS, RLY NEW COLONY VISAKHAPATNAM -530016

ABC building, XYZ Road, Ville Parle (East), Mumbai 400 057

The addresses are not delimited by the same delimiter. So I have created a list of district names of India. My problem is to match each district name with all addresses. In case the string matches, i want to record the name in another variable named as 'city'. How should I match the list of city names with the string variable containing addresses?
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

08 Jan 2019, 08:46

Some questions/observations.

1) Do your individual addresses always include the 6-digit India Pincode postal codes? Probably not, so the real question is, can you ignore observations with a missing pincode? Remember that no matter what you do, nothing is going to be perfect for 400,000 observations, so you will have to settle for good enough no matter what you do.

2) Is it the case that in the above data, the three district (city) names that you want to match are Delhi, Visakhapatnam, and Mumbai?

3) Looking at a map of India suggests there are many cities with more than one word in their name, such as Mumbai and Navi Mumbai. Are these treated as different districts/cities?

My own belief is that matching on city names will be difficult at best, since you have no reliable way of separating the city from the rest of the address. With my limited familiarity with your data, I think it might be easier to use the 6-digit pincode and match it to one of the downloadable pincode directories at https://www.indiapost.gov.in/vas/pages/findpincode.aspx . You will have to reduce the pincode directory to one observation for each pincode, which I think should be possible because the pincodes represent multiple delivery areas within a district, I believe.

Here is some code that extracts the pincode from your sample data (augmented by one observation lacking a pincode, for my testing.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str80 address "MAIN MARKET OPP.POLICE STATION, ABC STREET, DELHI-400001" "ABC PRINTERS, RLY NEW COLONY VISAKHAPATNAM -530016" "ABC building, XYZ Road, Ville Parle (East), Mumbai 400 057" "XYZ bakery, XYZ Road, Ville Parle (West), Mumbai" end generate strpin = ustrregexs(0) if ustrregexm(address,"[0-9 ]*$") replace strpin = usubinstr(strpin," ","",.) list, clean noobs

Code:

. list, clean noobs address strpin MAIN MARKET OPP.POLICE STATION, ABC STREET, DELHI-400001 400001 ABC PRINTERS, RLY NEW COLONY VISAKHAPATNAM -530016 530016 ABC building, XYZ Road, Ville Parle (East), Mumbai 400 057 400057 XYZ bakery, XYZ Road, Ville Parle (West), Mumbai
1 like
Comment
Kedar Kelkar

Join Date: Sep 2018

Posts: 47
#3

08 Jan 2019, 21:23

You are right. All observations do not have pin codes. Only around 50% addresses have pin codes. So those are sorted out for now. But I cannot afford to drop 50% of the data. So i am extracting the district/city names from the addresses. One city has couple of pin codes, but that can be handled later. You are right about New Mumbai being different from Mumbai. So, i plan to make it 75% programmatic and 15% eyeballing and manual cleaning.

Is there a way i can make a list of all districts and match it with the string of addresses? I want something of the following sorts. (The syntax is definitely wrong, its only to give an idea)

Code:

generate district local district_list "city1" "city2" foreach n in district_list: { if district_list[_n] is contained in address[_n]}: replace district= district_list[_n]
Comment
Kedar Kelkar

Join Date: Sep 2018

Posts: 47
#4

08 Jan 2019, 21:42

Code:

gen dummy = strpos(address, "orissa") > 0 //gen state="" replace state = "orissa" if dummy ==1 drop dummy

This piece of code enables me to search for the word orissa (indian state). If it is found, then i replace state as orissa.
My next problem is to do the same with districts. But instead of naming each district one by one, i want to make a list of all district names and use them one by one
Comment

David Benson

Join Date: Oct 2018
Posts: 489

09 Jan 2019, 00:26

Hi Kedar,

1) Could you provide 20-25 more addresses? (Change some of the street names or something if you need to). People will be able to give you *much* better help that way (it also makes it easier to see the various permutations in the data). Please use Stata's dataex command when doing so. If you need help, I created a Youtube video on it here. (Watch at 2x speed :-)

2) Since you've been able to extract the PIN codes (for the 50%) that have them, your next step might be to try and isolate the city in your data (some ideas on that below). Once you've isolated the cities, you can merge in based on city (using a directory like the one William linked to). Or you can use some of the fuzzy matching commands listed below.

Code:

* If these names are not likely to otherwise show up elsewhere (i.e. no Bangalore Street in Delhi) you could run this immediately and see what you get
local city_list "Delhi Bangalore Hyderabad Ahmedabad Chennai Kolkata Surat Pune Jaipur Lucknow Kanpur Nagpur Visakhapatnam Indore"
gen city = "Mumbai" if strpos(address, "Mumbai") > 1

foreach c in `city_list' {
  replace city = "`c'" if strpos(address, "`c'") > 1
}

* If not, use split command to split address by comma as next pass. In obs #1 and #3, once you do that you could remove numbers and be left with the city
split address, parse(,) gen(part)

foreach v of varlist part* {
replace `v' = itrim(trim(`v'))
}

egen p_count = rownonmiss(part*), strok  // count how many parts the address had
list part* p_count, noobs

  +-----------------------------------------------------------------------------------------------------------------------+
  |                          part1                                  part2                part3            part4   p_count |
  |-----------------------------------------------------------------------------------------------------------------------|
  | MAIN MARKET OPP.POLICE STATION                             ABC STREET         DELHI-400001                          3 |
  |                   ABC PRINTERS   RLY NEW COLONY VISAKHAPATNAM -530016                                               2 |
  |                   ABC building                               XYZ Road   Ville Parle (East)   Mumbai 400 057         4 |
  +-----------------------------------------------------------------------------------------------------------------------+

* I assume city will end up in the last part this way
* If you have lots of parts, can do this as a loop
gen city = part4 if p_count==4
replace city = part3 if p_count==3
replace city = part2 if p_count==2

. list city

     +--------------------------------------+
     |                                 city |
     |--------------------------------------|
  1. |                         DELHI-400001 |
  2. | RLY NEW COLONY VISAKHAPATNAM -530016 |
  3. |                       Mumbai 400 057 |
     +--------------------------------------+

replace city = ustrregexra(city, "[-0-9]","")  // this removes all numbers and gets rid of the "-" as well
* Could also do as: strkeep city, replace alpha keep(" ")   (SSC install strkeep)
replace city = itrim(trim(city))

. list city

     +------------------------------+
     |                         city |
     |------------------------------|
  1. |                        DELHI |
  2. | RLY NEW COLONY VISAKHAPATNAM |
  3. |                       Mumbai |
     +------------------------------+

* Now run the loop over cities (or merge with a Stata file of cities & states)
* So this would seem to solve obs #1 and #3, but #2 would need further (probably manual) processing

Some other useful commands / resources that you might explore:
1) This Stata Journal article: Record linkage using Stata: Preprocessing, linking, and reviewing utilities link.

Abstract
In this article, we describe Stata utilities that facilitate probabilistic record linkage—the technique typically used for merging two datasets with no common record identifier. While the preprocessing tools are developed specifically for linking two company databases, the other tools can be used for many different types of linkage. Specifically, the stnd_compname and stnd_address commands parse and standardize company names and addresses to improve the match quality when linking.

Keywords: dm0082, reclink2, clrevmatch, reclink, stnd_compname, stnd_address, record linkage, fuzzy matching, string standardization

2) This blog post on text matching in Stata, https://orgtheory.wordpress.com/2012...text-matching/

3) Stata's matchit and strgroup commands (SSC)

Last edited by David Benson; 09 Jan 2019, 01:13.

Comment

David Benson

Join Date: Oct 2018

Posts: 489
#6

09 Jan 2019, 01:16

Also might want to take a look at the user-written dtalink command (SSC install dtalink). For more info, see https://www.stata.com/meeting/columb...18_Kranker.pdf
Comment

Kedar Kelkar

Join Date: Sep 2018
Posts: 47

09 Jan 2019, 03:58

Thanks a ton David. That was a great help. Creating a list and passing each element through the addresses helps. For states, I did the following;

Code:

gen state=""
local state_list `" "andaman" "nicobar"    "andhra" "arunachal"    "assam"    "bihar"    "chandigarh"    "chattisgarh"    "dadra & nagar haveli"    "daman" "diu"    "goa"    "gujarat"    "haryana"    "himachal"    "jammu"  "kashmir"    "jharkhand"    "karnataka"    "kerala" "lakshadweep"    "madhya pradesh"    "maharashtra"    "manipur"    "meghalaya"    "mizoram"    "nagaland"    "delhi"    "odisha"    "puducherry"    "punjab"    "rajasthan"    "sikkim"    "tamil nadu"    "telangana"    "tripura"    "uttar pradesh"    "uttarakhand"    "west bengal" "'

foreach x in `state_list'{
    gen dummy1=.
    replace dummy1 =  strpos(address, "`x'") > 0
    replace state = "`x'" if dummy1 ==1
    drop dummy1
}

Yet there is a problem. But I think that needs some eyeballing to solve. There is the popular tourist state called 'Goa'. It can get printed alongside a place called "Goalpara". Probably this calls for some manual work.

I took the idea of splitting the address. But the addresses are so long that they create 15 to 20 split variables, which becomes very unclean and difficult to handle.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

09 Jan 2019, 06:20

Kedar Kelkar - your reply in post #2 reassures me that you have a realistic expectation of what can be accomplished.

Others on Statalist with similar problems have had good results using the matchit command written by Julio Raffo and available from SSC (see the output of search matchit for details). It is designed for the sort of problem you are having, and can produce multiple potential matches for a given observation with a "score" indicating similarity. This can perhaps help you reduce the amount of manual effort involved, as you mention in post #7.

The dtalink command mentioned in post #6 is newer than matchit and is likely also suitable to your task, but so far there's little reported experience with it found by searching Statalist.

If I were doing this, I would start with100 observations from your dataset to train myself in how best to choose the available options, and what to expect from the output. I'd then try say 1000 observations to see if the assumptions scale up, and work my way up to the full dataset.

I encourage you strongly to expend your effort in learning to use one of these two commands (or both, and post a comparative review here afterwards!) rather than in trying to implement your own ad-hoc parsing techniques that these commands improve on.

I think if you do try matchit or dtalink and encounter problems, you would be well advised to start a new topic including the command name in the title,

Good luck, and let us know what results you have. This is an recurring problem posed on Statalist and every additional piece of information helps.
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

09 Jan 2019, 07:36

My two cents:

Yes, you can use -matchit- to help you on this. An easy way forward is matching your original file with a file containing the name of cities. I post as follows a very simple example that shows how to implement it. Have in mind that this will find also the city names as part of street names. For instance, if there is a "Mumbai street" in Delhi.

Also, the same can be applied to a list of postal codes. Hope it helps.

Code:

clear
input str80 address
"MAIN MARKET OPP.POLICE STATION, ABC STREET, DELHI-400001"  
"ABC PRINTERS, RLY NEW COLONY VISAKHAPATNAM -530016"        
"ABC building, XYZ Road, Ville Parle (East), Mumbai 400 057"
"XYZ bakery, XYZ Road, Ville Parle (West), Mumbai"
end
replace address=upper(address) // note that matchit is case sensitive
replace address=trim(itrim(ustrregexra(address, "[^a-z0-9 ]", " ",1))) // this is needed if you use the sim(token) option to avoid missing the "DELHI-400001' kind of cases
gen addr_id=_n
save addr_file,replace
clear
input str80 city
"DELHI"  
"Bengaluru"        
"Mumbai"
"Chennai"
"Visakhapatnam"
end
replace city=upper(city)
gen city_id=_n
save city_file, replace

matchit city_id city using addr_file.dta, idu(addr_id) txtu(address) sim(token) s(minsimple)
list
/*
     +-------------------------------------------------------------------------------------------------------+
     | city_id            city   addr_id                                                  address   simils~e |
     |-------------------------------------------------------------------------------------------------------|
  1. |       1           DELHI         1   MAIN MARKET OPP POLICE STATION ABC STREET DELHI 400001          1 |
  2. |       3          MUMBAI         3    ABC BUILDING XYZ ROAD VILLE PARLE EAST MUMBAI 400 057          1 |
  3. |       3          MUMBAI         4              XYZ BAKERY XYZ ROAD VILLE PARLE WEST MUMBAI          1 |
  4. |       5   VISAKHAPATNAM         2         ABC PRINTERS RLY NEW COLONY VISAKHAPATNAM 530016          1 |
     +-------------------------------------------------------------------------------------------------------+
*/
use city_file, clear
matchit city_id city using addr_file.dta, idu(addr_id) txtu(address) s(minsimple) // if you think there might be misspellings, use bigram (default) or any other n-gram  option
list
/*
     +-------------------------------------------------------------------------------------------------------+
     | city_id            city   addr_id                                                  address   simils~e |
     |-------------------------------------------------------------------------------------------------------|
  1. |       1           DELHI         1   MAIN MARKET OPP POLICE STATION ABC STREET DELHI 400001          1 |
  2. |       3          MUMBAI         3    ABC BUILDING XYZ ROAD VILLE PARLE EAST MUMBAI 400 057          1 |
  3. |       3          MUMBAI         4              XYZ BAKERY XYZ ROAD VILLE PARLE WEST MUMBAI          1 |
  4. |       5   VISAKHAPATNAM         2         ABC PRINTERS RLY NEW COLONY VISAKHAPATNAM 530016          1 |
     +-------------------------------------------------------------------------------------------------------+
*/

Comment

Kedar Kelkar

Join Date: Sep 2018

Posts: 47
#10

11 Jan 2019, 00:21

So i used -matchit- It is undoubtedly useful. But for a sample size of 400000+, it made multiple matches and gave scores for each. Resultant output was of 700000 observations. Further, the problem of Mumbai highway in Pune being recorded as Mumbai (with the match score = 1) persists. I personally found it more useful to use -strpos- than -matchit-. My observation is that -matchit- needs a rigorous manual check. Slightly better is -strpos- where i have comparatively lesser number of observations to go through.
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

#11

11 Jan 2019, 01:35

Removing multiple matches after using -matchit- can be relatively straightforward in Stata. William Lisowski has posted here an example to do it, which adapted to this thread example is:

Code:

...
// reduce crosswalk to highest score for each observation from the first dataset
by addr_id (similscore): keep if _n==_N
...

I do agree is that knowing the matching city names position could allow you to pick the rightmost one which could be useful to solve your "Mumbai highway" kind of issues. You can do this by creating a variable for each city name in a very long do file. But you can also do this after running -matchit- and have the best of two worlds (;-D). As follows an example based on the above sample data:

Code:

// add "Mumbai highway 999, Pune 012 345" to the address file
// add "pune" to the city file
...
matchit city_id city using addr_file.dta, idu(addr_id) txtu(address) sim(token) s(minsimple)
gen matchpos=strpos(address,city)
list if addr_id==5, noobs
/*
  +------------------------------------------------------------------------------------+
  | city_id     city   addr_id                           address   simils~e   matchpos |
  |------------------------------------------------------------------------------------|
  |       3   MUMBAI         5   MUMBAI HIGHWAY 999 PUNE 012 345          1          1 |
  |       6     PUNE         5   MUMBAI HIGHWAY 999 PUNE 012 345          1         20 |
  +------------------------------------------------------------------------------------+
*/
bysort addr_id (similscore matchpos): keep if _n==_N
list if addr_id==5, noobs
/*
  +----------------------------------------------------------------------------------+
  | city_id   city   addr_id                           address   simils~e   matchpos |
  |----------------------------------------------------------------------------------|
  |       6   PUNE         5   MUMBAI HIGHWAY 999 PUNE 012 345          1         20 |
  +----------------------------------------------------------------------------------+
*/

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

11 Jan 2019, 06:04

I would generalize the advice from Julio Raffo to the following. With 400,000 poorly formatted addresses, why would you want to ignore the advice matchit can give you to narrow down the possibilities? As Julio points out, the output produced by matchit is available for processing, just as the input is. Why not start with matchit output and limit your efforts to correcting systematic problems, with matchit have first correctly matched VISAKHAPATNAM to the typographical error VISKHAPATNAM?

Let me propose an alternative to the "highway" problem. As part of your initial preparation of the data, where you will be converting everything to uppercase, I would also remove excess internal spaces, so there is no more than one in a row, and then replace " HIGHWAY" with "HIGHWAY" so that you would then have "MUMBAIHIGHWAY" which I believe reduce the similarity score for that match.
1 like
Comment
Kedar Kelkar

Join Date: Sep 2018

Posts: 47
#13

15 Jan 2019, 01:27

William Lisowski, I agree with you. But the pains increase manifold due to -matchit-. I ll give you an example. -matchit- gives an option of changing the benchmark score. Default is 0.3 (I kept it the same). In the final stage, i am left with only 55000 observations (addresses). After running matchit, the resultant number of observations is 810,000. Then i looked at observations having score == 1. Many observations with score 1 are not actually matching. So now I have to check each of the 810,000 observations manually! Isn't it then better to take care of 55,000 observations manually?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#14

15 Jan 2019, 07:21

I would not suggest "checking observations manually" if by that you mean "read each observation and decide what to do with it." I would not do that with 810,000 observations and I would not do that with 55,000 observations.

I have said as much as I can on this problem. You have to do whatever you are comfortable with.
Comment
Kedar Kelkar

Join Date: Sep 2018

Posts: 47
#15

18 Jan 2019, 04:27

Thanks William Lisowski. Of course by manual i did not mean going after each and every observation. By manual i mean more checks and balances. For instance, India has 650 odd districts. 35 states and UTs. I do some merging and matching and get data on the pin codes. For this, we need significant amount of eyeballing, not '"read each observation and decide what to do with it".

Thanks for the help on this thread.
Comment

Announcement