Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging based on AreaNames : Standardizing Unique Identifier

    Hi,
    I am using Stata 13 on Windows 10.

    I have two datsets that I had to merge on basis of area_name as there was no unique identifier.
    Some observations matched, but a lot did not. I have attached some of the non-matched observations (about 400 in total).

    I see two types of issues that can improve the merge result.
    First, change the name of state at the end of the observation (after comma) in the variable area_name. For example, "Allentown-Bethlehem-Easton, PA-NJ" in master dataset and "Allentown-Bethlehem-Easton, PA" in using dataset. Please note that there are some observations that after splitting/parsing at comma, are not unique. For example, there are may be multiple "Allentown-Bethlehem-Easton" in the dataset, in which case I cannot merge or have to use duplicates drop before merging.

    Second, change the name of the area at beginning of the observation (before comma) in variable area_name. For example, "Atlanta-Sandy Springs-Roswell, GA" in master dataset and "Atlanta, GA" in using dataset. For these cases, I was thinking about splitting the strings further (parse on spaces, "-" and ","), and generate a measure of how many splitted strings match.

    One approach is to use brute force and eyeball the entire dataset, change variable area_name observation and re-merge.
    However, I am wondering if there are better ways to proceed forward with this problem? Thanks in advance.

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str46 area_name byte m
    "Allentown-Bethlehem-Easton, PA-NJ"      1
    "Ames, IA"                               1
    "Anniston-Oxford-Jacksonville, AL"       1
    "Appleton, WI"                           1
    "Athens-Clarke County, GA"               1
    "Atlanta-Sandy Springs-Roswell, GA"      1
    "Atlantic City-Hammonton, NJ"            1
    "Augusta-Richmond County, GA-SC"         1
    "Austin-Round Rock, TX"                  1
    "Baltimore-Columbia-Towson, MD"          1
    "Barnstable Town, MA"                    1
    "Battle Creek, MI"                       1
    "Bay City, MI"                           1
    "Beckley, WV"                            1
    "Bend-Redmond, OR"                       1
    "Birmingham-Hoover, AL"                  1
    "Blacksburg-Christiansburg-Radford, VA"  1
    "Bloomington, IL"                        1
    "Bloomsburg-Berwick, PA"                 1
    "Boston-Cambridge-Newton, MA-NH"         1
    "Boulder, CO"                            1
    "Bowling Green, KY"                      1
    "Bremerton-Silverdale, WA"               1
    "Bridgeport-Stamford-Norwalk, CT"        1
    "Brownsville-Harlingen, TX"              1
    "Brunswick, GA"                          1
    "Buffalo-Cheektowaga-Niagara Falls, NY"  1
    "Burlington, NC"                         1
    "Burlington-South Burlington, VT"        1
    "California-Lexington Park, MD"          1
    "Cape Coral-Fort Myers, FL"              1
    "Cape Girardeau, MO-IL"                  1
    "Carbondale-Marion, IL"                  1
    "Carson City, NV"                        1
    "Chambersburg-Waynesboro, PA"            1
    "Charlotte-Concord-Gastonia, NC-SC"      1
    "Chicago-Naperville-Elgin, IL-IN-WI"     1
    "Chico, CA"                              1
    "Clarksville, TN-KY"                     1
    "Cleveland, TN"                          1
    "Cleveland-Elyria, OH"                   1
    "Coeur d'Alene, ID"                      1
    "College Station-Bryan, TX"              1
    "Crestview-Fort Walton Beach-Destin, FL" 1
    "Dallas-Fort Worth-Arlington, TX"        1
    "Dalton, GA"                             1
    "Danville, IL"                           1
    "Daphne-Fairhope-Foley, AL"              1
    "Dayton, OH"                             1
    "Aguadilla, PR"                          2
    "Allentown-Bethlehem-Easton, PA"         2
    "Anniston, AL"                           2
    "Appleton-Oshkosh-Neenah, WI"            2
    "Arecibo, PR"                            2
    "Athens, GA"                             2
    "Atlanta, GA"                            2
    "Atlantic-Cape May, NJ"                  2
    "Augusta-Aiken, GA-SC"                   2
    "Austin-San Marcos, TX"                  2
    "Baltimore, MD"                          2
    "Barnstable-Yarmouth, MA"                2
    "Benton Harbor, MI"                      2
    "Bergen-Passaic, NJ"                     2
    "Biloxi-Gulfport-Pascagoula, MS"         2
    "Birmingham, AL"                         2
    "Bloomington, IN"                        2
    "Bloomington-Normal, IL"                 2
    "Boston, MA-NH"                          2
    "Boulder-Longmont, CO"                   2
    "Brazoria, TX"                           2
    "Bremerton, WA"                          2
    "Bridgeport, CT"                         2
    "Brockton, MA"                           2
    "Brownsville-Harlingen-San Benito, TX"   2
    "Bryan-College Station, TX"              2
    "Buffalo-Niagara Falls, NY"              2
    "Burlington, VT"                         2
    "Caguas, PR"                             2
    "Charlotte-Gastonia-Rock Hill, NC-SC"    2
    "Chicago, IL"                            2
    "Chico-Paradise, CA"                     2
    "Clarksville-Hopkinsville, TN-KY"        2
    "Cleveland-Lorain-Elyria, OH"            2
    "Columbia, SC"                           2
    "Columbus, OH"                           2
    "Dallas, TX"                             2
    "Danbury, CT"                            2
    "Danville, VA"                           2
    "Dayton-Springfield, OH"                 2
    "Daytona Beach, FL"                      2
    "Decatur, IL"                            2
    "Denver, CO"                             2
    "Des Moines, IA"                         2
    "Detroit, MI"                            2
    "Duluth-Superior, MN-WI"                 2
    "Dutchess County, NY"                    2
    "Eugene-Springfield, OR"                 2
    "Evansville-Henderson, IN-KY"            2
    "Fargo-Moorhead, ND-MN"                  2
    "Fayetteville-Springdale-Rogers, AR"     2
    end
    label values m _merge
    label def _merge 1 "master only (1)", modify
    label def _merge 2 "using only (2)", modify

  • #2
    There is some material on line about efficient ways to do this. Personally, with only 400 problem observations, I'd be tempted to do it manually.

    Comment


    • #3
      Hi Phil,
      Can you please share more details of the arterial you are referencing to? Thanks.

      Comment


      • #4
        -ssc describe reclink- gives information onn one relevant community-contributed command.

        Comment


        • #5
          Thank you Mike.

          Comment

          Working...
          X