codes to spell out full name based on initial

ji zhou

Join Date: Jul 2014

Posts: 46
#1

codes to spell out full name based on initial

01 Aug 2014, 15:14

Hello STATA experts, please help me with a recoding challenge. The dataset contains multiple obs per name. Give the same person, some obs have the last and first names fully spelled out, while others have the last name spelled out and only initial of the first name. It looks like the following:

lastname firstname
smith michael
smith m
smith m
smith michael
johnson l
johnson l
johnson linda
johnson l

How do I recode the firstname so that each ob has the firstname spelled out? The dataset has about 200 names and 2200 obs.

Thank you so much!
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

01 Aug 2014, 15:35

well, if we can be *certain* that "smith m" is Smith Michael, it could be as simple as:

Code:

gen fnlength=length(firstname) gsort lastname -fnlength replace firstname=firstname[_n-1] if lastname==lastname[_n-1]

but this makes some heroic assumptions, including that last names are unique.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#3

01 Aug 2014, 16:11

ps. I hope you have an ID variable apart from the names. In which case,

Code:

gen fnlength=length(firstname) gsort realID lastname -fnlength replace firstname=firstname[_n-1] if realID[_n-1]==realID[_n-1]

is better.

Last edited by ben earnhart; 01 Aug 2014, 16:13.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#4

01 Aug 2014, 16:33

Here is another heroic approach:

Code:

clear all input str20 lastname str20 firstname smith michael smith m smith m smith michael johnson l johnson l johnson linda johnson l end bysort lastname (firstname): gen newfirst = firstname[_N] list

Only the bysort command is actually needed once you have the data.

Other spelling inconsistencies could screw up any of these approaches, e.g. Mike, Michael, Mick.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
ji zhou

Join Date: Jul 2014

Posts: 46
#5

04 Aug 2014, 08:52

Ben and Richard, thank you both so much! Yes, all three approaches worked for people with unique last names. There are a few whose last names are identical, and a few whose last names and first name initial are identical. Is there any way to handle such situation? There is no ID variable...Given that this dataset is not large, I can do it manually....But I'd love to learn if there are cleverer ways to handle similar situation with large dataset. Many thanks! Any advice is appreciated!
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#6

04 Aug 2014, 09:10

For the cases where the last name and first initial are identical, how do you know which full first name to use? If you can carefully describe the rules that you use to make the decision when doing the correction manually we can probably help you program it. Without more information, though, it seems like whatever choice you make to assign full names in those instances would be arbitrary.
Comment

Richard Williams

Join Date: Apr 2014
Posts: 5008

04 Aug 2014, 09:33

I would be nervous about using an automated solution under the conditions you describe. Even if, as Sarah says, you could figure out the rules, it might take you far longer to program them than it would take to just fix the data manually.

Tweaking my earlier code,

Code:

clear all
input str20 lastname str20 firstname
smith michael
smith m
smith m
smith michael
johnson l
johnson l
johnson linda
johnson l
davis r
davis rich
davis robert
end
gen initial = substr(firstname,1,1)
bysort lastname initial (firstname): gen newfirst = firstname[_N]
list, sepby(lastname)

you see that it works ok when lastname & first initial are unique, but it breaks down when they aren't. I guess you could identify the breakdowns by looking at the listing; but I can't guarantee that there aren't other problems I am overlooking.

Code:

. list, sepby(lastname)

     +------------------------------------------+
     | lastname   firstn~e   initial   newfirst |
     |------------------------------------------|
  1. |    davis          r         r     robert |
  2. |    davis       rich         r     robert |
  3. |    davis     robert         r     robert |
     |------------------------------------------|
  4. |  johnson          l         l      linda |
  5. |  johnson          l         l      linda |
  6. |  johnson          l         l      linda |
  7. |  johnson      linda         l      linda |
     |------------------------------------------|
  8. |    smith          m         m    michael |
  9. |    smith          m         m    michael |
 10. |    smith    michael         m    michael |
 11. |    smith    michael         m    michael |
     +------------------------------------------+

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

ji zhou

Join Date: Jul 2014

Posts: 46
#8

04 Aug 2014, 09:50

Hi Sarah, I do that by looking at the courses they teach listed on their CVs (The dataset is about faculty course evaluations). So for example, for two faculty with same last name and same initial, I check the courseid and semester listed on their CV.

Last First Courseid Semester
James Patrick EDUC100 Fall2010
James Patrick EDUC300 Spring2011
James Peter EDUC100 Spring2011
James Peter EDUC200 Fall2010
James P EDUC600 Spring2011
James P EDUC100 Fall2011

By checking their CVs, I was able to know the first James P was Patrick and the second was Peter. I almost feel doing this manually is the only viable option. But I hope I am wrong.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#9

04 Aug 2014, 10:02

With my code P, Patrick and Peter would all get coded as Peter. You could visually identify such cases. Robert and Bob would sneak by you though.

The last line of my code would be better as

Code:

list, sepby(lastname initial)

I suppose you could add additional error checking code if you have to do this a lot or have thousands of records.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
ji zhou

Join Date: Jul 2014

Posts: 46
#10

04 Aug 2014, 10:05

Thank you Richard. What I have planned to do is to generate the full first name for those with unique last name and initial, then checking the errors manually for those with same last name and same initial.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#11

04 Aug 2014, 10:19

It occurs to me that the above code (or at least mine) will still cause Patrick and Peter to get recoded as Peter. It could probably be fixed though, e.g. only change the name when the first name is only one letter long.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
ji zhou

Join Date: Jul 2014

Posts: 46
#12

04 Aug 2014, 10:43

Thank you Richard! You are very helpful!
Comment

Richard Williams

Join Date: Apr 2014
Posts: 5008

#13

04 Aug 2014, 11:02

This may be slightly better. It only changes the first name if the original first name was only a single letter (but that will create problems if the first name was actually the first two initials). It also creates a variable called namechange that lets you easily see what records were changed.

Code:

clear all
input str20 lastname str20 firstname
smith michael
smith m
smith m
smith michael
johnson l
johnson l
johnson linda
johnson l
davis r
davis rich
davis robert
james patrick
james peter
james p
end
* Preserve original ordering of cases in case it is needed
gen nrec = _n
gen initial = substr(firstname,1,1)
bysort lastname initial (firstname): gen newfirst = firstname[_N]
* Only change the first name if an initial only was used
* This will create different problems if two initials were used!
replace newfirst = firstname if length(firstname) > 1
gen namechange = firstname != newfirst
list, sepby(lastname initial)
list if namechange

Code:

. list, sepby(lastname initial)

     +------------------------------------------------------------+
     | lastname   firstn~e   nrec   initial   newfirst   namech~e |
     |------------------------------------------------------------|
  1. |    davis          r      9         r     robert          1 |
  2. |    davis       rich     10         r       rich          0 |
  3. |    davis     robert     11         r     robert          0 |
     |------------------------------------------------------------|
  4. |    james          p     14         p      peter          1 |
  5. |    james    patrick     12         p    patrick          0 |
  6. |    james      peter     13         p      peter          0 |
     |------------------------------------------------------------|
  7. |  johnson          l      5         l      linda          1 |
  8. |  johnson          l      8         l      linda          1 |
  9. |  johnson          l      6         l      linda          1 |
 10. |  johnson      linda      7         l      linda          0 |
     |------------------------------------------------------------|
 11. |    smith          m      3         m    michael          1 |
 12. |    smith          m      2         m    michael          1 |
 13. |    smith    michael      4         m    michael          0 |
 14. |    smith    michael      1         m    michael          0 |
     +------------------------------------------------------------+

. list if namechange

     +------------------------------------------------------------+
     | lastname   firstn~e   nrec   initial   newfirst   namech~e |
     |------------------------------------------------------------|
  1. |    davis          r      9         r     robert          1 |
  4. |    james          p     14         p      peter          1 |
  7. |  johnson          l      5         l      linda          1 |
  8. |  johnson          l      8         l      linda          1 |
  9. |  johnson          l      6         l      linda          1 |
     |------------------------------------------------------------|
 11. |    smith          m      3         m    michael          1 |
 12. |    smith          m      2         m    michael          1 |
     +------------------------------------------------------------+

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

ji zhou

Join Date: Jul 2014

Posts: 46
#14

06 Aug 2014, 16:36

Thanks Richard! I didn't know the bysort command before until I read your codes. Very helpful. Many thanks.
Comment

Announcement