Create a dichotomous variable based on the titles of individuals

Chris Boulis

Join Date: Feb 2019
Posts: 368

Create a dichotomous variable based on the titles of individuals

26 May 2025, 22:45

Hi Statalist.

I'd like to create a dichotomous variable based on the title in the name of each individual in my dataset. The first category will contain all the people who have 'Associate Professor" or 'Professor' as the title of their name, and the second category will contain all the people who have 'Dr' (PhD) as the title of their name. I then wish to total all grant money (-grant2-) by the cat1 (e.g., professors) compared to that in cat2 (doctors/dr) for each year. Here's a sample of my data

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int grant_year strL grant_id long grant2 str1224 lead_investigator
2022 "10461" 5053012 "Prof an oh"                
2021 "11608" 5050741 "A/Prof even att"           
2020 "12773" 5037948 "Prof te ith-les"         
2022 "10468" 5032188 "A/Prof tthias eusler"      
2022 "10467" 5022681 "Prof eng ang"              
2022 "10469" 5012184 "Prof ett ull"              
2022 "10463" 4992796 "Prof dia rawska"           
2017 "16317" 4991057 "Prof vid utens"            
2022 "10462" 4983515 "A/Prof phie terme"         
2020 "12776" 4971151 "Prof trick xton"           
2008 "30025" 4901791 "Prof iguo an"              
2014 "18887" 4894110 "Prof ngai ng"              
2021 "11606" 4689777 "A/Prof en ng Lim"           
2022 "10470" 4560617 "Prof rk askovich"          
2016 "17626" 4552526 "Prof ent nro"              
2013 "20275" 4541209 "A/Prof uart oy"             
2017 "16306" 4486773 "Prof an oh"                
2003 "39248" 4484676 "Prof mond rvis"           
2017 "16302" 4336895 "Dr ean stone"             
2019 "14925" 4320000 "Prof an ith"               
2012 "22804" 4298513 "Dr von ue"                 
2015 "18644" 3795360 "Prof ianne illy-stra"
2023 "40792" 3776112 "A/Prof llip sey"         
2023 "40788" 3772703 "Prof ander ilton"       
2018 "15197" 3765289 "Prof mina bawa"               
2019 "13952" 3458344 "Prof cela lek"            
end

I understand how to create a dichotomous variable, although I am not sure of the function I need to use to search titles from which to create the new variable - I tried using the following code[ - although Stata noted this is "invalid syntax".CODE]gen group = .
replace group = 1 if strstr(lead_investigator, "(Prof|A\/Prof|Asst Prof|Adj A/Prof|Adj/Prof|Em/Prof|Hon A/Prof|associate professor|professor)")
replace group = 2 if strstr(lead_investigator, "(dr|Dr|doctor)")[/CODE]I appreciate help with this. Regards, Chris (Stata SE 17.0).

Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17704

27 May 2025, 00:36

Chris:
I would go as follows:

Code:

split lead_investigator, p()
egen first_categorical=group( lead_investigator1 )
label define first_categorical 3 "Prof" 2 "Dr" 1 "A/Prof"
label val first_categorical first_categorical
. list

     +--------------------------------------------------------------------------------------------------------------+
     | grant_~r   grant_id    grant2      lead_investigator   lead_i~1   lead_i~2   lead_in~3   lead_i~4   first_~l |
     |--------------------------------------------------------------------------------------------------------------|
  1. |     2022      10461   5053012             Prof an oh       Prof         an          oh                  Prof |
  2. |     2021      11608   5050741        A/Prof even att     A/Prof       even         att                A/Prof |
  3. |     2020      12773   5037948        Prof te ith-les       Prof         te     ith-les                  Prof |
  4. |     2022      10468   5032188   A/Prof tthias eusler     A/Prof     tthias      eusler                A/Prof |
  5. |     2022      10467   5022681           Prof eng ang       Prof        eng         ang                  Prof |
     |--------------------------------------------------------------------------------------------------------------|
  6. |     2022      10469   5012184           Prof ett ull       Prof        ett         ull                  Prof |
  7. |     2022      10463   4992796        Prof dia rawska       Prof        dia      rawska                  Prof |
  8. |     2017      16317   4991057         Prof vid utens       Prof        vid       utens                  Prof |
  9. |     2022      10462   4983515      A/Prof phie terme     A/Prof       phie       terme                A/Prof |
 10. |     2020      12776   4971151        Prof trick xton       Prof      trick        xton                  Prof |
     |--------------------------------------------------------------------------------------------------------------|
 11. |     2008      30025   4901791           Prof iguo an       Prof       iguo          an                  Prof |
 12. |     2014      18887   4894110           Prof ngai ng       Prof       ngai          ng                  Prof |
 13. |     2021      11606   4689777       A/Prof en ng Lim     A/Prof         en          ng        Lim     A/Prof |
 14. |     2022      10470   4560617       Prof rk askovich       Prof         rk    askovich                  Prof |
 15. |     2016      17626   4552526           Prof ent nro       Prof        ent         nro                  Prof |
     |--------------------------------------------------------------------------------------------------------------|
 16. |     2013      20275   4541209         A/Prof uart oy     A/Prof       uart          oy                A/Prof |
 17. |     2017      16306   4486773             Prof an oh       Prof         an          oh                  Prof |
 18. |     2003      39248   4484676         Prof mond rvis       Prof       mond        rvis                  Prof |
 19. |     2017      16302   4336895           Dr ean stone         Dr        ean       stone                    Dr |
 20. |     2019      14925   4320000            Prof an ith       Prof         an         ith                  Prof |
     |--------------------------------------------------------------------------------------------------------------|
 21. |     2012      22804   4298513              Dr von ue         Dr        von          ue                    Dr |
 22. |     2015      18644   3795360   Prof ianne illy-stra       Prof      ianne   illy-stra                  Prof |
 23. |     2023      40792   3776112        A/Prof llip sey     A/Prof       llip         sey                A/Prof |
 24. |     2023      40788   3772703       Prof ander ilton       Prof      ander       ilton                  Prof |
 25. |     2018      15197   3765289         Prof mina bawa       Prof       mina        bawa                  Prof |
     |--------------------------------------------------------------------------------------------------------------|
 26. |     2019      13952   3458344          Prof cela lek       Prof       cela         lek                  Prof |
     +--------------------------------------------------------------------------------------------------------------+

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35580
#3

27 May 2025, 02:32

Dichotomous (aka binary, indicator, dummy, Boolean, quantal, logical, one-hot) variables have two distinct values, which are often best coded 0 and 1.

I don't know where the idea to call up strstr() -- supposedly a Stata function with some kind of regular expression functionality -- comes from, but it looks like an LLM hallucination to me. Otherwise where did you come across it?

At its simplest your problem looks like

Code:

. gen title = word(lead_investigator, 1) . gen wanted = inlist(lower(title), "dr", "doctor") . . tab title wanted | wanted title | 0 1 | Total -----------+----------------------+---------- A/Prof | 6 0 | 6 Dr | 0 2 | 2 Prof | 18 0 | 18 -----------+----------------------+---------- Total | 24 2 | 26

but in a large data base, I wouldn't be surprised at all sorts of inconsistencies and variations in spelling and punctuation. Naturally you may reverse the coding by negation.

More at https://journals.sagepub.com/doi/pdf...36867X19830921 and in its references.
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

27 May 2025, 05:22

Thank you for the code and links - they are very helpful Nick Cox. One issue is that there are other non-academic titles - as shown below

Code:

tab title EMCR 
           |         EMCR
     title |         0          1 |     Total
-----------+----------------------+----------
    A/Prof |     3,779          0 |     3,779 
       Adj |        15          0 |        15 
  Adj/Prof |        95          0 |        95 
      Asst |        28          0 |        28 
        Dr |         0      4,898 |     4,898 
   Em/Prof |     1,156          0 |     1,156 
       Hon |        66          0 |        66 
        Mr |        68          0 |        68 
       Mrs |         2          0 |         2 
        Ms |        35          0 |        35 
      Prof |    21,245          0 |    21,245 
       Rev |         1          0 |         1 
       and |         1          0 |         1 
      with |         1          0 |         1 
-----------+----------------------+----------
     Total |    26,492      4,898 |    31,390

The code reversed

Code:

gen title2 = word(lead_investigator, 1)
gen SCR = inlist(lower(title2), "prof", "a/prof", "adj", "adj/prof", "em/prof", "hon a/prof", "associate professor", "professor") 
tab title2 SCR

provided

Code:

. tab title2 SCR 
           |          SCR
    title2 |         0          1 |     Total
-----------+----------------------+----------
    A/Prof |         0      3,779 |     3,779 
       Adj |         0         15 |        15 
  Adj/Prof |         0         95 |        95 
      Asst |         28         0 |        28 
        Dr |     4,898          0 |     4,898 
   Em/Prof |         0      1,156 |     1,156 
       Hon |        66          0 |        66 
        Mr |        68          0 |        68 
       Mrs |         2          0 |         2 
        Ms |        35          0 |        35 
      Prof |         0     21,245 |    21,245 
       Rev |         1          0 |         1 
       and |         1          0 |         1 
      with |         1          0 |         1 
-----------+----------------------+----------
     Total |     5,072     26,318 |    31,390

Would I need to add a -replace- to limit titles to those that are academic - either by listing those that are (Prof Dr, Prof, Dr, A\/Prof, Adj, A/Prof, Adj/Prof, Em/Prof, Hon A/Prof, Associate Professor, Professor) or those that are not (Sir, Md, Mr, Ms, Miss, Mrs, Mr/s)?

(Checking my understanding of the use of -lower- in -inlist- makes all text lower case, thereby reducing the need to specify 'Dr' and 'dr'? Very nice). Your help is appreciated.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35580
#5

27 May 2025, 05:34

Let's look at the question again from #1.

The first category will contain all the people who have 'Associate Professor" or 'Professor' as the title of their name, and the second category will contain all the people who have 'Dr' (PhD) as the title of their name

My answer is that given this question it's sufficient to look for "Dr" or variants -- and that works fine with your data example.

Now you're confirming that the real, much larger dataset is more complicated, as anyone would fear from experience.

But it's your call what to do. Clearly the few instances of "and" and "with" need specific surgery, while perhaps you need something other than a dichotomous variable.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1364
#6

27 May 2025, 06:42

I'd be interested in seeing the original strings and the code that led to the cases with "and" and "with" as the title -- there may be some upstream issues which should be fixed.

Also, note that just isolating the first word will not pick up two-word titles like "Prof Dr" or "Hon A/Prof" or "Associate Professor" (are there three word titles too? like say "Honorary Associate Professor"?), so you may want to use more robust strategies to figure out the title.
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#7

27 May 2025, 19:56

Thanks Nick Cox and Hemanshu Kumar for your responses. Point taken, I should have made it clearer there were other titles in #1 besides those of interest. Further to the point in #5, I take this meant that I need a categorical variable - I used

Code:

gen title = word(lead_investigator, 1) gen group = . replace group = 1 if inlist(lower(title), "dr", "doctor", "asst prof") replace group = 2 if inlist(lower(title), "prof", "a/prof", "adj", "adj/prof", "em/prof", "hon a/prof", "associate professor", "professor") tab title group

providing the following, which looks like what I want

Code:

tab title group | group title | 1 2 | Total -----------+----------------------+---------- A/Prof | 0 3,779 | 3,779 Adj | 0 15 | 15 Adj/Prof | 0 95 | 95 Dr | 4,898 0 | 4,898 Em/Prof | 0 1,156 | 1,156 Prof | 0 21,245 | 21,245 -----------+----------------------+---------- Total | 4,898 26,290 | 31,188

I located the two instances of "and" and "with" appearing in "investigator_name" - these stem from an error that occurred when importing the file from Excel - I'd appreciate code that could determine if there are other oddities.

Yes Hemanshu Kumar there are some three-word titles - I tried amending the first line in #3 from 1 to 3 though this led to 'no observations' so I'd appreciate code to help me search for multiple word titles.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35580
#8

28 May 2025, 00:56

It's, as said, your call on what categorisation you need. You need

Code:

tab title group, missing

to check up on what happened to the titles not covered by 1 and 2.
1 like
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1364
#9

Yesterday, 09:53

I would want to make sure we're starting with an exhaustive set of titles -- start by not worrying about errors of inclusion, but only those of exclusion. You might, for instance, want to start with a manual list of core title words (or parts of words) (like Prof, Assoc, Adj, Dr, Hon, Rev, Em) and use functions like strpos() to check for their presence in a name. Then look at the full strings in which they appear and develop strategies to isolate them. For instance, "Prof" may only ever appear in a title, while "Em" might appear in a title or as part of a name like Emily, so you'll need different ways to tackle those two cases. Then see if some of them only appear at certain word positions or as part of certain word combinations, and use that to fine-tune the strategy. It's going to take some manual work, I don't think there is a trivial automated way to do this.
1 like
Comment

Announcement

Create a dichotomous variable based on the titles of individuals

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment