Identification of gender from names through machine learning methods

Ataullah Khan

Join Date: Jun 2017
Posts: 41

Identification of gender from names through machine learning methods

23 Jul 2019, 01:02

Hi,

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str37 name str6 gender
"Miss Najam Asma"     ""      
"Abdul Jabar"         ""      
"Kalsoom Begum"       ""      
"Noor Saeed Badshah"  ""      
"Sumaira"             ""      
"Beena Gul"           ""      
"Farkhanda Jabeen"    ""      
"Maqsoom Shah"        "Female"
"Shabeena"            ""      
"Ghani Rehman"        ""      
"Faizan Khan"         ""      
"Sami Ullah Khan"     "Male"  
"Mir Atta Ullah Khan" "Male"  
"Rana Begum"          "Female"
"Shafqat Ullah Khan"  "Male"  
"Nihada Begum"        "Female"
"Qaisar Zaman"        ""      
"Nijatullah"          "Male"  
"Nizakat Begum"       "Female"
"Nazir Ullah Khan"    "Male"  
"Wahab Ullah"         "Male"  
"Nadeem Khan"         "Male"  
"Rafia Khattak"       "Female"
"Abid Shah"           "Male"  
"Anwar Ullah Khan"    "Male"  
"Nasreen  Akhtar"     ""      
"Dilshada Bibi"       "Female"
"Waqar Un Nisa"       "Female"
"Nawab Khan"          "Male"  
"Sania Sattar"        "Female"
end

Above is a snapshot of my data set. It has missing values in gender column. I want to predict if a person is male or female based on the names given in left column.
Is there a way to implement a machine learning method for this and how to go about it.

Grateful

Tags: machine learning, regex, stata

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

23 Jul 2019, 01:18

Ataullah:
this sounds more like a cultural than a statistical question, as one has to know which names in your dataset are given to boys and girls (admittedly, I cannot say).
Moreover, you may have (I do not know whether what follows applies to your example though) cases when in some countries the same name is given to boys only whereas in other countries it can be given to boys and girls. An example that springs to my mind is the name Andrea, which in Italy is given to boys (although some exceptions to this rule have coming alive in the recent past), whereas in Germany it can be given to boys and girls.

Kind regards,
Carlo
(Stata 19.0)
Comment
Ataullah Khan

Join Date: Jun 2017

Posts: 41
#3

23 Jul 2019, 01:30

Thanks Carlo Lazzaro. I understood what you said.

Indeed, I know which names are given to boys and which to girls, and I have around 60% of data which have gender against names.
Now, how do I use that 60% of data and train my machine in Stata to predict the remaining 40% combined with my own understanding of the naming system.

One way is to do it manually. But, it would take huge amount of time. My observation are in millions.

Last edited by Ataullah Khan; 23 Jul 2019, 01:32.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#4

23 Jul 2019, 02:12

Ataullah: knowledge of the use of language to indicate male/female is going to be hugely relevant, as Carlo says. In essence, you appear to have a missing data problem, and wish to make imputations. What information do you have? Answer: what is embedded in your (a) str37 name variable + (b) cultural knowledge. Working with (a) alone, it appears that there are potentially 3 pieces of information there: (i) salutation (Miss, Mr, etc.; may not be present); (ii) first name; (iii) second name. Well, that's all you've told us about: see paragraph at end.]

In this case, a first step would be to split your str37 name variable into these components and work from there? It seems that if (i) is present, then you know gender. A second step would be .use your existing cases with no missing values on gender to do the imputation. Imagine you write out those cases to a new file (and assume there are no measurement errors). Then merge that complete-case file back onto the records with missing gender, using (say) first name as the merge key. [I say first name because I am assuming this provides you the most gender-specific information, but that may differ by cultural context.] A similar idea would be to try and find an external database that has information recording specific names and gender and use that for the merge.

None of this involves machine learning! If you want to go further and use this, I suspect you will have to do some reading first. Your situation is complicated because it's unclear (to me) what the nature of the imputation model would be. You want to predict gender using what information as predictors, and how? Also, do you have other information in your data set that might be used as a predictor? [Hypothetical example: suppose that in this cultural context, girls never leave home before age 25, but boys have to leave home at age 20. Then information about age and household composition might help predict gender.]
1 like
Comment
Ataullah Khan

Join Date: Jun 2017

Posts: 41
#5

23 Jul 2019, 02:32

Thank you. I do not have any other predictors.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#6

23 Jul 2019, 05:55

Originally posted by Carlo Lazzaro View Post

An example that springs to my mind is the name Andrea, which in Italy is given to boys (although some exceptions to this rule have coming alive in the recent past), whereas in Germany it can be given to boys and girls.

Actually, Andrea would usually be a female in Germany; the male form would be: Andreas. Nothing stops parents with an Italian background from naming their son Andrea, though.

This thread might be interesting.

Best
Daniel
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#7

23 Jul 2019, 09:14

Thanks Daniel for the clarification.

Kind regards,
Carlo
(Stata 19.0)
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#8

08 Aug 2019, 19:47

This has been done by the gender() package in R, thought it may not cover the Muslim country that you shared names from:
https://cran.r-project.org/web/packa...ng-gender.html

The simple, brute-force strategy is to figure out the % male for each first name in the data, then use that to estimate the probability that someone with one of those names is male. Most names are unambiguous (Paul, Jane); some are ambiguous (Pat); some change genders over time (Hillary, Vivian), so you need to know the birth year as well as the name.

None of this involves any machine learning. If you have enough data, it's typically enough to assign gender to 95% of new cases.

The question arises what to do if a name isn't on the list. Sometimes there's nothing you can do, but in some languages, certain characteristics of names are associated with gender. In Spanish, for example, first names ending in -o are typically male. That's a strong predictor. In English, male names tend to be shorter, though that's a weak predictor.

Let us know what you come up with!
Comment
Ataullah Khan

Join Date: Jun 2017

Posts: 41
#9

22 Sep 2019, 22:43

Thanks. Will try it and let you all know
Comment

Announcement

Identification of gender from names through machine learning methods

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment