Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate Unique Group ID in a Panel Data with Spelling Variations

    Dear Statalist users,
    I have created a panel dataset based on election results (this is a fairly large dataset across 8 elections and I have only including a small portion of 3 election cycles here)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int year str9 state str7 city str10(village winner) str9 votes
    2000 "karnataka" "mysore"  "thirumpete" "rajesha"    "1000"    
    2000 "karnataka" "mysore"  "narsipura"  "vanaja"     "850"      
    2000 "karnataka" "mysore"  "patna"      "kumara"     "900"      
    2000 "karnataka" "mysore"  "hd kote"    "hitesh"     "1989"    
    2005 "karnatak"  "mysore"  "tirumpete"  "rajesha"    "157"      
    2005 "karnatak"  "mysore"  "narsipur"   "vikram"     "1244"    
    2005 "karnatak"  "mysore"  "patna"      "umayal"     "234"      
    2005 "karnatak"  "mysore"  "hdkote"     "amina bano" "999"      
    2010 "karnataka" "mysor e" "thirumpete" "rajesha"    "134"      
    2010 "karnataka" "mysor e" "narsipura"  "vanaja"     "593"      
    2010 "karnataka" "mysor e" "patnaa"     "amina bano" "unopposed"
    2010 "karnataka" "mysor e" "hd kote"    "muddassir"  "1241"    
    end
    This is election data for different villages in the city of Mysore, from state Karnataka with the name of the winner and number of votes received.
    I need a panel that has a unique id for different villages, along with election winners over the years. However, due to variations in the spellings of the state, city and village, I am not able to think of a tractable way to do this.

    Thanks!
    Last edited by Patrick Que; 15 Dec 2022, 22:59. Reason: paneldata, fuzzymatch, groupid,datawrangling

  • #2
    Progress Update: My attempts to solve this has led to think about creating a long list of all the village names, and then matching the village names with this long list.

    Comment


    • #3
      Cross-posted on Stack Overflow. Please note our policy on cross-posting, which is that you tell us about it.

      Comment


      • #4
        Re #2: yes, but how will you "match" the villages with the long list? Unless the long list is actually a crosswalk between all possible variant spellings and the standard spelling, -merge- would just leave you with all spelling errors unmatched. The best tool for this fuzzy matching task in Stata is, as far as I know, Julio Raffo's -matchit-, available from SSC. The use of -matchit- is a bit complicated. You will need to invest some time reading the helpful and gaining an understanding of how it works, and then it will take some trial and error to find option settings that give the best results for your data. In the end, it will pair up each observation in your data set with other observations that are plausible matches on state, city, and village. But in all likelihood, you are going to have to weed through those results by eye and separate "the chaff from the wheat." It is possible that despite your best efforts, there will be some ambiguous situations that cannot be resolved, though probably only a handful.

        Comment


        • #5
          Hi Nick, I apologize for this. I will be more thorough next time onwards

          Comment

          Working...
          X