Merging based on AreaNames : Standardizing Unique Identifier

Arvind Sharma

Join Date: Jan 2016
Posts: 89

Merging based on AreaNames : Standardizing Unique Identifier

30 Jul 2018, 13:01

Hi,
I am using Stata 13 on Windows 10.

I have two datsets that I had to merge on basis of area_name as there was no unique identifier.
Some observations matched, but a lot did not. I have attached some of the non-matched observations (about 400 in total).

I see two types of issues that can improve the merge result.
First, change the name of state at the end of the observation (after comma) in the variable area_name. For example, "Allentown-Bethlehem-Easton, PA-NJ" in master dataset and "Allentown-Bethlehem-Easton, PA" in using dataset. Please note that there are some observations that after splitting/parsing at comma, are not unique. For example, there are may be multiple "Allentown-Bethlehem-Easton" in the dataset, in which case I cannot merge or have to use duplicates drop before merging.

Second, change the name of the area at beginning of the observation (before comma) in variable area_name. For example, "Atlanta-Sandy Springs-Roswell, GA" in master dataset and "Atlanta, GA" in using dataset. For these cases, I was thinking about splitting the strings further (parse on spaces, "-" and ","), and generate a measure of how many splitted strings match.

One approach is to use brute force and eyeball the entire dataset, change variable area_name observation and re-merge.
However, I am wondering if there are better ways to proceed forward with this problem? Thanks in advance.

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str46 area_name byte m
"Allentown-Bethlehem-Easton, PA-NJ"      1
"Ames, IA"                               1
"Anniston-Oxford-Jacksonville, AL"       1
"Appleton, WI"                           1
"Athens-Clarke County, GA"               1
"Atlanta-Sandy Springs-Roswell, GA"      1
"Atlantic City-Hammonton, NJ"            1
"Augusta-Richmond County, GA-SC"         1
"Austin-Round Rock, TX"                  1
"Baltimore-Columbia-Towson, MD"          1
"Barnstable Town, MA"                    1
"Battle Creek, MI"                       1
"Bay City, MI"                           1
"Beckley, WV"                            1
"Bend-Redmond, OR"                       1
"Birmingham-Hoover, AL"                  1
"Blacksburg-Christiansburg-Radford, VA"  1
"Bloomington, IL"                        1
"Bloomsburg-Berwick, PA"                 1
"Boston-Cambridge-Newton, MA-NH"         1
"Boulder, CO"                            1
"Bowling Green, KY"                      1
"Bremerton-Silverdale, WA"               1
"Bridgeport-Stamford-Norwalk, CT"        1
"Brownsville-Harlingen, TX"              1
"Brunswick, GA"                          1
"Buffalo-Cheektowaga-Niagara Falls, NY"  1
"Burlington, NC"                         1
"Burlington-South Burlington, VT"        1
"California-Lexington Park, MD"          1
"Cape Coral-Fort Myers, FL"              1
"Cape Girardeau, MO-IL"                  1
"Carbondale-Marion, IL"                  1
"Carson City, NV"                        1
"Chambersburg-Waynesboro, PA"            1
"Charlotte-Concord-Gastonia, NC-SC"      1
"Chicago-Naperville-Elgin, IL-IN-WI"     1
"Chico, CA"                              1
"Clarksville, TN-KY"                     1
"Cleveland, TN"                          1
"Cleveland-Elyria, OH"                   1
"Coeur d'Alene, ID"                      1
"College Station-Bryan, TX"              1
"Crestview-Fort Walton Beach-Destin, FL" 1
"Dallas-Fort Worth-Arlington, TX"        1
"Dalton, GA"                             1
"Danville, IL"                           1
"Daphne-Fairhope-Foley, AL"              1
"Dayton, OH"                             1
"Aguadilla, PR"                          2
"Allentown-Bethlehem-Easton, PA"         2
"Anniston, AL"                           2
"Appleton-Oshkosh-Neenah, WI"            2
"Arecibo, PR"                            2
"Athens, GA"                             2
"Atlanta, GA"                            2
"Atlantic-Cape May, NJ"                  2
"Augusta-Aiken, GA-SC"                   2
"Austin-San Marcos, TX"                  2
"Baltimore, MD"                          2
"Barnstable-Yarmouth, MA"                2
"Benton Harbor, MI"                      2
"Bergen-Passaic, NJ"                     2
"Biloxi-Gulfport-Pascagoula, MS"         2
"Birmingham, AL"                         2
"Bloomington, IN"                        2
"Bloomington-Normal, IL"                 2
"Boston, MA-NH"                          2
"Boulder-Longmont, CO"                   2
"Brazoria, TX"                           2
"Bremerton, WA"                          2
"Bridgeport, CT"                         2
"Brockton, MA"                           2
"Brownsville-Harlingen-San Benito, TX"   2
"Bryan-College Station, TX"              2
"Buffalo-Niagara Falls, NY"              2
"Burlington, VT"                         2
"Caguas, PR"                             2
"Charlotte-Gastonia-Rock Hill, NC-SC"    2
"Chicago, IL"                            2
"Chico-Paradise, CA"                     2
"Clarksville-Hopkinsville, TN-KY"        2
"Cleveland-Lorain-Elyria, OH"            2
"Columbia, SC"                           2
"Columbus, OH"                           2
"Dallas, TX"                             2
"Danbury, CT"                            2
"Danville, VA"                           2
"Dayton-Springfield, OH"                 2
"Daytona Beach, FL"                      2
"Decatur, IL"                            2
"Denver, CO"                             2
"Des Moines, IA"                         2
"Detroit, MI"                            2
"Duluth-Superior, MN-WI"                 2
"Dutchess County, NY"                    2
"Eugene-Springfield, OR"                 2
"Evansville-Henderson, IN-KY"            2
"Fargo-Moorhead, ND-MN"                  2
"Fayetteville-Springdale-Rogers, AR"     2
end
label values m _merge
label def _merge 1 "master only (1)", modify
label def _merge 2 "using only (2)", modify

Tags: None

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

31 Jul 2018, 10:49

There is some material on line about efficient ways to do this. Personally, with only 400 problem observations, I'd be tempted to do it manually.
Comment
Arvind Sharma

Join Date: Jan 2016

Posts: 89
#3

31 Jul 2018, 11:33

Hi Phil,
Can you please share more details of the arterial you are referencing to? Thanks.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#4

01 Aug 2018, 08:15

-ssc describe reclink- gives information onn one relevant community-contributed command.
Comment
Arvind Sharma

Join Date: Jan 2016

Posts: 89
#5

02 Aug 2018, 19:19

Thank you Mike.
Comment

Announcement

Merging based on AreaNames : Standardizing Unique Identifier

Comment

Comment

Comment

Comment