Generate Unique Group ID in a Panel Data with Spelling Variations

Patrick Que

Join Date: Jul 2020
Posts: 13

Generate Unique Group ID in a Panel Data with Spelling Variations

15 Dec 2022, 22:57

Dear Statalist users,
I have created a panel dataset based on election results (this is a fairly large dataset across 8 elections and I have only including a small portion of 3 election cycles here)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year str9 state str7 city str10(village winner) str9 votes
2000 "karnataka" "mysore"  "thirumpete" "rajesha"    "1000"    
2000 "karnataka" "mysore"  "narsipura"  "vanaja"     "850"      
2000 "karnataka" "mysore"  "patna"      "kumara"     "900"      
2000 "karnataka" "mysore"  "hd kote"    "hitesh"     "1989"    
2005 "karnatak"  "mysore"  "tirumpete"  "rajesha"    "157"      
2005 "karnatak"  "mysore"  "narsipur"   "vikram"     "1244"    
2005 "karnatak"  "mysore"  "patna"      "umayal"     "234"      
2005 "karnatak"  "mysore"  "hdkote"     "amina bano" "999"      
2010 "karnataka" "mysor e" "thirumpete" "rajesha"    "134"      
2010 "karnataka" "mysor e" "narsipura"  "vanaja"     "593"      
2010 "karnataka" "mysor e" "patnaa"     "amina bano" "unopposed"
2010 "karnataka" "mysor e" "hd kote"    "muddassir"  "1241"    
end

This is election data for different villages in the city of Mysore, from state Karnataka with the name of the winner and number of votes received.
I need a panel that has a unique id for different villages, along with election winners over the years. However, due to variations in the spellings of the state, city and village, I am not able to think of a tractable way to do this.

Thanks!

Last edited by Patrick Que; 15 Dec 2022, 22:59. Reason: paneldata, fuzzymatch, groupid,datawrangling

Tags: None

Patrick Que

Join Date: Jul 2020

Posts: 13
#2

16 Dec 2022, 09:47

Progress Update: My attempts to solve this has led to think about creating a long list of all the village names, and then matching the village names with this long list.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35658
#3

16 Dec 2022, 13:52

Cross-posted on Stack Overflow. Please note our policy on cross-posting, which is that you tell us about it.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#4

16 Dec 2022, 14:06

Re #2: yes, but how will you "match" the villages with the long list? Unless the long list is actually a crosswalk between all possible variant spellings and the standard spelling, -merge- would just leave you with all spelling errors unmatched. The best tool for this fuzzy matching task in Stata is, as far as I know, Julio Raffo's -matchit-, available from SSC. The use of -matchit- is a bit complicated. You will need to invest some time reading the helpful and gaining an understanding of how it works, and then it will take some trial and error to find option settings that give the best results for your data. In the end, it will pair up each observation in your data set with other observations that are plausible matches on state, city, and village. But in all likelihood, you are going to have to weed through those results by eye and separate "the chaff from the wheat." It is possible that despite your best efforts, there will be some ambiguous situations that cannot be resolved, though probably only a handful.
Comment
Patrick Que

Join Date: Jul 2020

Posts: 13
#5

16 Dec 2022, 16:22

Hi Nick, I apologize for this. I will be more thorough next time onwards
Comment

Announcement

Generate Unique Group ID in a Panel Data with Spelling Variations

Comment

Comment

Comment

Comment