Making best guesses on race/ethnicity/gender

Richard Williams

Join Date: Apr 2014

Posts: 5008
#1

Making best guesses on race/ethnicity/gender

25 Nov 2017, 09:24

Does anybody know of any programs in Stata (or other software) that will make best guesses as to race/ethnicity/gender based on a person's first and last names? Limited geographic info is also available, e.g. zipcode, county. I'm primarily interested in identifying Hispanics.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

25 Nov 2017, 10:18

Perhaps making use of the Census Bureau list of Spanish Surnames would be a starting point. Here's a 10-year-old paper that apparently makes use of the then-current version of the list. I couldn't bear to try to find a link to the list itself on the Census Bureau web site.

http://www.thecre.com/insurance/wp-c...CFPB-Paper.pdf

Last edited by William Lisowski; 25 Nov 2017, 10:19. Reason: Forgot to include the link.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#3

25 Nov 2017, 10:27

Thanks! This looks promising:

https://www.census.gov/data/develope.../surnames.html

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

25 Nov 2017, 10:39

Richard, this is a very interesting question! I have a feeling that it may be hard to find free software or Stata code for this task.

The U.S. Census Bureau appears to have made lists of Spanish surnames, probably to help conduct tasks like these. Some links are below - the last link showed up on a Google search, but I can't verify that it is genuinely from the Census Bureau. The links may help you with resources to identify individuals of Hispanic descent, but additional programming on your end would be required.

https://www.census.gov/population/ww.../twps0004.html
https://www.census.gov/population/do...on/twpno13.pdf
https://fcds.med.miami.edu/downloads...20Surnames.pdf

Of note, the second source above lists the 639 most common Hispanic surnames in the US, which the authors think to be acceptable for some uses by itself (even if it won't capture every person of Hispanic ancestry in a list).

In Medicare data, the race/ethnicity information we have in the beneficiary enrollment files is not always correct, due to past limitations in how the Social Security Administration asked individuals to report their race/ethnicity. Research Triangle Institute does supply an 'enhanced' race/ethnicity classification based on surname lists, and their approach is described here. A review paper that describes the use of both geocoding and surname lists is here. Given the amount of work involved that's involved apart from getting the surname list, I don't

Last, you no doubt are aware that using surnames and geographic location to infer individual race/ethnicity may be useful for aggregate analyses, but it gets dicier when it applies to individuals. I'll relate an anecdote. One of my former co-workers was a woman with the surname Tan. She was White, which surprised me when she came to work. Turns out her husband was Malaysian Chinese - Tan is the Hokkien romanization of the Chinese surname Chen, and it's probably the most common southeast Asian Chinese surname. One of my wife's cousins married a Korean American guy and took on the surname Kim, which is the most common Korean surname. My wife, by mutual agreement, didn't take on my surname (which I usually just spell out, but after doing so, I frequently get asked "could you spell that?").

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Chris Larkin

Join Date: Apr 2016

Posts: 296
#5

25 Nov 2017, 11:12

See here Richard: https://genderize.io/

It's a pretty neat product, that you can access through their API. Very possible to make a Stata program that calls it. A cool feature is that it assigns a probability of how accurate it thinks its 'guess' is. So for Peter you get male with probability 1, and for Jamie you get male with probability 0.53
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#6

25 Nov 2017, 18:23

Chris Larkin , this looks cool. But at the risk of sounding really stupid, how would I actually give a command? Do I have to be running JSON (whatever that is)? Or do I download some program? Or what? Thanks.

ADDED: OK, I tried typing this in a browser window

https://api.genderize.io/?name=pat

and it returned

{"name":"pat","gender":"female","probability":0.6, "count":1242} Is there some easier way, or some program that makes it easier?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment

Chris Larkin

Join Date: Apr 2016
Posts: 296

25 Nov 2017, 19:37

The below works. Sometimes it returns errors i'm not familiar with, and my guess is they're to do with the limits on the API. Wait for ten seconds and then try again if you get an error (that sorted it out for me).

If you have a long list of names it could be worth building in some kind of rest (e.g. sleep 1000) on each pass of the loop

Code:

clear all
tempfile genderized
save `genderized', emptyok

local names Jeff Bob Alan Georgina Mary Hannah
foreach name of local names{
    copy "https://api.genderize.io/?name=`name'" "`name'.csv", text replace
    import delimited "`name'.csv", varnames(nonames) encoding(ISO-8859-1) clear
    rename (v1 v2 v3 v4) (name gender probability count)
    append using `genderized'
    save `genderized', replace
}

And this cleans up the strings

Code:

replace name = trim(substr(name, strpos(name, ":")+2, strlen(name)-(strpos(name, ":")+2)))
replace gender = trim(subinstr(gender, `"gender":""', "",.))
replace probability = trim(substr(probability, strpos(probability, ":")+1,.))
replace count = trim(substr(count, strpos(count, ":")+1, strlen(count)-(strpos(count, ":")+1)))
compress _all

Last edited by Chris Larkin; 25 Nov 2017, 20:14. Reason: Adding string cleaning

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5008
#8

26 Nov 2017, 08:27

Fantastic. My data set has 78000 cases, but I assume there are far fewer first names than that.

It would be nice if I could just extract a data set of first names for the US, and then just do a regular merge with first name as the matching variable. But it seems a bit more difficult than that.

Thanks to everyone for their advice. I haven't tried these ideas yet, but will hopefully do so this week. I will get back with results or questions. The results don't have to be super-accurate, but I am hoping to see if there appear to be differences in treatment by ethnicity and/or gender. If so, we can dig deeper.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Kevin Morris

Join Date: Jun 2018

Posts: 3
#9

20 Jun 2018, 13:55

Hi Richard,
I see this is from last year, but I figured I'd chime in in case anyone else comes across this.
The CFPB puts out Stata code that can use surnames AND zipcode level demographics from the Census Bureau to assign racial probabilities. In my experience of running it against self-reported data, it works very well (assuming that the racial characteristics of the population you're looking at is similar to the overall population)
You can find the code here:
https://github.com/cfpb/proxy-methodology
1 like
Comment

Dimitriy V. Masterov

Join Date: Mar 2014
Posts: 609

#10

13 Aug 2018, 10:45

I tweaked Chris Larkin's code to handle cases where the API has no data:

Code:

clear all
tempfile genderized
save `genderized', emptyok

local names "Alice Bob Cthulhu"

foreach name of local names{
    copy "https://api.genderize.io/?name=`name'" "`name'.csv", text replace
    import delimited "`name'.csv", varnames(nonames) encoding(ISO-8859-1) clear
    
    /* Deal with unknowns */
    if v2==`"gender":null}"' {
        replace v2="unknown"
        gen v3=""
    }
        
    capture drop v4
        
    rename (v1 v2 v3) (name gender probability)
    append using `genderized'
    save `genderized', replace
}

replace name = trim(substr(name, strpos(name, ":")+2, strlen(name)-(strpos(name, ":")+2)))
replace gender = trim(subinstr(gender, `"gender":""', "",.))
replace probability = trim(substr(probability, strpos(probability, ":")+1,.))
compress _all

strrec gender ("male" = 0 "Male") ("female" = 1 "Female") ("unknown" = 2 "Unknown"), replace

destring probability, replace

tab gender [iw = probability]

Announcement