Hi,
I am using Stata 13 on Windows 10.
I have two datsets that I had to merge on basis of area_name as there was no unique identifier.
Some observations matched, but a lot did not. I have attached some of the non-matched observations (about 400 in total).
I see two types of issues that can improve the merge result.
First, change the name of state at the end of the observation (after comma) in the variable area_name. For example, "Allentown-Bethlehem-Easton, PA-NJ" in master dataset and "Allentown-Bethlehem-Easton, PA" in using dataset. Please note that there are some observations that after splitting/parsing at comma, are not unique. For example, there are may be multiple "Allentown-Bethlehem-Easton" in the dataset, in which case I cannot merge or have to use duplicates drop before merging.
Second, change the name of the area at beginning of the observation (before comma) in variable area_name. For example, "Atlanta-Sandy Springs-Roswell, GA" in master dataset and "Atlanta, GA" in using dataset. For these cases, I was thinking about splitting the strings further (parse on spaces, "-" and ","), and generate a measure of how many splitted strings match.
One approach is to use brute force and eyeball the entire dataset, change variable area_name observation and re-merge.
However, I am wondering if there are better ways to proceed forward with this problem? Thanks in advance.
----------------------- copy starting from the next line -----------------------
I am using Stata 13 on Windows 10.
I have two datsets that I had to merge on basis of area_name as there was no unique identifier.
Some observations matched, but a lot did not. I have attached some of the non-matched observations (about 400 in total).
I see two types of issues that can improve the merge result.
First, change the name of state at the end of the observation (after comma) in the variable area_name. For example, "Allentown-Bethlehem-Easton, PA-NJ" in master dataset and "Allentown-Bethlehem-Easton, PA" in using dataset. Please note that there are some observations that after splitting/parsing at comma, are not unique. For example, there are may be multiple "Allentown-Bethlehem-Easton" in the dataset, in which case I cannot merge or have to use duplicates drop before merging.
Second, change the name of the area at beginning of the observation (before comma) in variable area_name. For example, "Atlanta-Sandy Springs-Roswell, GA" in master dataset and "Atlanta, GA" in using dataset. For these cases, I was thinking about splitting the strings further (parse on spaces, "-" and ","), and generate a measure of how many splitted strings match.
One approach is to use brute force and eyeball the entire dataset, change variable area_name observation and re-merge.
However, I am wondering if there are better ways to proceed forward with this problem? Thanks in advance.
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str46 area_name byte m "Allentown-Bethlehem-Easton, PA-NJ" 1 "Ames, IA" 1 "Anniston-Oxford-Jacksonville, AL" 1 "Appleton, WI" 1 "Athens-Clarke County, GA" 1 "Atlanta-Sandy Springs-Roswell, GA" 1 "Atlantic City-Hammonton, NJ" 1 "Augusta-Richmond County, GA-SC" 1 "Austin-Round Rock, TX" 1 "Baltimore-Columbia-Towson, MD" 1 "Barnstable Town, MA" 1 "Battle Creek, MI" 1 "Bay City, MI" 1 "Beckley, WV" 1 "Bend-Redmond, OR" 1 "Birmingham-Hoover, AL" 1 "Blacksburg-Christiansburg-Radford, VA" 1 "Bloomington, IL" 1 "Bloomsburg-Berwick, PA" 1 "Boston-Cambridge-Newton, MA-NH" 1 "Boulder, CO" 1 "Bowling Green, KY" 1 "Bremerton-Silverdale, WA" 1 "Bridgeport-Stamford-Norwalk, CT" 1 "Brownsville-Harlingen, TX" 1 "Brunswick, GA" 1 "Buffalo-Cheektowaga-Niagara Falls, NY" 1 "Burlington, NC" 1 "Burlington-South Burlington, VT" 1 "California-Lexington Park, MD" 1 "Cape Coral-Fort Myers, FL" 1 "Cape Girardeau, MO-IL" 1 "Carbondale-Marion, IL" 1 "Carson City, NV" 1 "Chambersburg-Waynesboro, PA" 1 "Charlotte-Concord-Gastonia, NC-SC" 1 "Chicago-Naperville-Elgin, IL-IN-WI" 1 "Chico, CA" 1 "Clarksville, TN-KY" 1 "Cleveland, TN" 1 "Cleveland-Elyria, OH" 1 "Coeur d'Alene, ID" 1 "College Station-Bryan, TX" 1 "Crestview-Fort Walton Beach-Destin, FL" 1 "Dallas-Fort Worth-Arlington, TX" 1 "Dalton, GA" 1 "Danville, IL" 1 "Daphne-Fairhope-Foley, AL" 1 "Dayton, OH" 1 "Aguadilla, PR" 2 "Allentown-Bethlehem-Easton, PA" 2 "Anniston, AL" 2 "Appleton-Oshkosh-Neenah, WI" 2 "Arecibo, PR" 2 "Athens, GA" 2 "Atlanta, GA" 2 "Atlantic-Cape May, NJ" 2 "Augusta-Aiken, GA-SC" 2 "Austin-San Marcos, TX" 2 "Baltimore, MD" 2 "Barnstable-Yarmouth, MA" 2 "Benton Harbor, MI" 2 "Bergen-Passaic, NJ" 2 "Biloxi-Gulfport-Pascagoula, MS" 2 "Birmingham, AL" 2 "Bloomington, IN" 2 "Bloomington-Normal, IL" 2 "Boston, MA-NH" 2 "Boulder-Longmont, CO" 2 "Brazoria, TX" 2 "Bremerton, WA" 2 "Bridgeport, CT" 2 "Brockton, MA" 2 "Brownsville-Harlingen-San Benito, TX" 2 "Bryan-College Station, TX" 2 "Buffalo-Niagara Falls, NY" 2 "Burlington, VT" 2 "Caguas, PR" 2 "Charlotte-Gastonia-Rock Hill, NC-SC" 2 "Chicago, IL" 2 "Chico-Paradise, CA" 2 "Clarksville-Hopkinsville, TN-KY" 2 "Cleveland-Lorain-Elyria, OH" 2 "Columbia, SC" 2 "Columbus, OH" 2 "Dallas, TX" 2 "Danbury, CT" 2 "Danville, VA" 2 "Dayton-Springfield, OH" 2 "Daytona Beach, FL" 2 "Decatur, IL" 2 "Denver, CO" 2 "Des Moines, IA" 2 "Detroit, MI" 2 "Duluth-Superior, MN-WI" 2 "Dutchess County, NY" 2 "Eugene-Springfield, OR" 2 "Evansville-Henderson, IN-KY" 2 "Fargo-Moorhead, ND-MN" 2 "Fayetteville-Springdale-Rogers, AR" 2 end label values m _merge label def _merge 1 "master only (1)", modify label def _merge 2 "using only (2)", modify
Comment