Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Making best guesses on race/ethnicity/gender

    Does anybody know of any programs in Stata (or other software) that will make best guesses as to race/ethnicity/gender based on a person's first and last names? Limited geographic info is also available, e.g. zipcode, county. I'm primarily interested in identifying Hispanics.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

  • #2
    Perhaps making use of the Census Bureau list of Spanish Surnames would be a starting point. Here's a 10-year-old paper that apparently makes use of the then-current version of the list. I couldn't bear to try to find a link to the list itself on the Census Bureau web site.

    http://www.thecre.com/insurance/wp-c...CFPB-Paper.pdf
    Last edited by William Lisowski; 25 Nov 2017, 10:19. Reason: Forgot to include the link.

    Comment


    • #3
      Thanks! This looks promising:

      https://www.census.gov/data/develope.../surnames.html
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      Stata Version: 17.0 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        Richard, this is a very interesting question! I have a feeling that it may be hard to find free software or Stata code for this task.

        The U.S. Census Bureau appears to have made lists of Spanish surnames, probably to help conduct tasks like these. Some links are below - the last link showed up on a Google search, but I can't verify that it is genuinely from the Census Bureau. The links may help you with resources to identify individuals of Hispanic descent, but additional programming on your end would be required.

        https://www.census.gov/population/ww.../twps0004.html
        https://www.census.gov/population/do...on/twpno13.pdf
        https://fcds.med.miami.edu/downloads...20Surnames.pdf

        Of note, the second source above lists the 639 most common Hispanic surnames in the US, which the authors think to be acceptable for some uses by itself (even if it won't capture every person of Hispanic ancestry in a list).

        In Medicare data, the race/ethnicity information we have in the beneficiary enrollment files is not always correct, due to past limitations in how the Social Security Administration asked individuals to report their race/ethnicity. Research Triangle Institute does supply an 'enhanced' race/ethnicity classification based on surname lists, and their approach is described here. A review paper that describes the use of both geocoding and surname lists is here. Given the amount of work involved that's involved apart from getting the surname list, I don't

        Last, you no doubt are aware that using surnames and geographic location to infer individual race/ethnicity may be useful for aggregate analyses, but it gets dicier when it applies to individuals. I'll relate an anecdote. One of my former co-workers was a woman with the surname Tan. She was White, which surprised me when she came to work. Turns out her husband was Malaysian Chinese - Tan is the Hokkien romanization of the Chinese surname Chen, and it's probably the most common southeast Asian Chinese surname. One of my wife's cousins married a Korean American guy and took on the surname Kim, which is the most common Korean surname. My wife, by mutual agreement, didn't take on my surname (which I usually just spell out, but after doing so, I frequently get asked "could you spell that?").
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          See here Richard: https://genderize.io/

          It's a pretty neat product, that you can access through their API. Very possible to make a Stata program that calls it. A cool feature is that it assigns a probability of how accurate it thinks its 'guess' is. So for Peter you get male with probability 1, and for Jamie you get male with probability 0.53

          Comment


          • #6
            Chris Larkin , this looks cool. But at the risk of sounding really stupid, how would I actually give a command? Do I have to be running JSON (whatever that is)? Or do I download some program? Or what? Thanks.

            ADDED: OK, I tried typing this in a browser window

            https://api.genderize.io/?name=pat

            and it returned

            {"name":"pat","gender":"female","probability":0.6, "count":1242} Is there some easier way, or some program that makes it easier?
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            Stata Version: 17.0 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              The below works. Sometimes it returns errors i'm not familiar with, and my guess is they're to do with the limits on the API. Wait for ten seconds and then try again if you get an error (that sorted it out for me).

              If you have a long list of names it could be worth building in some kind of rest (e.g. sleep 1000) on each pass of the loop


              Code:
              clear all
              tempfile genderized
              save `genderized', emptyok
              
              local names Jeff Bob Alan Georgina Mary Hannah
              foreach name of local names{
                  copy "https://api.genderize.io/?name=`name'" "`name'.csv", text replace
                  import delimited "`name'.csv", varnames(nonames) encoding(ISO-8859-1) clear
                  rename (v1 v2 v3 v4) (name gender probability count)
                  append using `genderized'
                  save `genderized', replace
              }
              And this cleans up the strings

              Code:
              replace name = trim(substr(name, strpos(name, ":")+2, strlen(name)-(strpos(name, ":")+2)))
              replace gender = trim(subinstr(gender, `"gender":""', "",.))
              replace probability = trim(substr(probability, strpos(probability, ":")+1,.))
              replace count = trim(substr(count, strpos(count, ":")+1, strlen(count)-(strpos(count, ":")+1)))
              compress _all
              Last edited by Chris Larkin; 25 Nov 2017, 20:14. Reason: Adding string cleaning

              Comment


              • #8
                Fantastic. My data set has 78000 cases, but I assume there are far fewer first names than that.

                It would be nice if I could just extract a data set of first names for the US, and then just do a regular merge with first name as the matching variable. But it seems a bit more difficult than that.

                Thanks to everyone for their advice. I haven't tried these ideas yet, but will hopefully do so this week. I will get back with results or questions. The results don't have to be super-accurate, but I am hoping to see if there appear to be differences in treatment by ethnicity and/or gender. If so, we can dig deeper.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                Stata Version: 17.0 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Hi Richard,
                  I see this is from last year, but I figured I'd chime in in case anyone else comes across this.
                  The CFPB puts out Stata code that can use surnames AND zipcode level demographics from the Census Bureau to assign racial probabilities. In my experience of running it against self-reported data, it works very well (assuming that the racial characteristics of the population you're looking at is similar to the overall population)
                  You can find the code here:
                  https://github.com/cfpb/proxy-methodology

                  Comment


                  • #10
                    I tweaked Chris Larkin's code to handle cases where the API has no data:

                    Code:
                    clear all
                    tempfile genderized
                    save `genderized', emptyok
                    
                    local names "Alice Bob Cthulhu"
                    
                    foreach name of local names{
                        copy "https://api.genderize.io/?name=`name'" "`name'.csv", text replace
                        import delimited "`name'.csv", varnames(nonames) encoding(ISO-8859-1) clear
                        
                        /* Deal with unknowns */
                        if v2==`"gender":null}"' {
                            replace v2="unknown"
                            gen v3=""
                        }
                            
                        capture drop v4
                            
                        rename (v1 v2 v3) (name gender probability)
                        append using `genderized'
                        save `genderized', replace
                    }
                    
                    replace name = trim(substr(name, strpos(name, ":")+2, strlen(name)-(strpos(name, ":")+2)))
                    replace gender = trim(subinstr(gender, `"gender":""', "",.))
                    replace probability = trim(substr(probability, strpos(probability, ":")+1,.))
                    compress _all
                    
                    strrec gender ("male" = 0 "Male") ("female" = 1 "Female") ("unknown" = 2 "Unknown"), replace
                    
                    destring probability, replace
                    
                    tab gender [iw = probability]

                    Comment

                    Working...
                    X