Calculate total research grant income by investigator

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1391
#16

24 Apr 2025, 04:04

That code looks fine to me, other than your use of tag_type in the list command, instead of the variable tag_nos that you created.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#17

24 Apr 2025, 06:37

Yes that worked - thank you Hemanshu Kumar - that was a silly error.
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

#18

15 May 2025, 06:07

All of the above provides with me really useful summary statistics for totals (in terms of total grants and total number of grants) Hemanshu Kumar, however, noting the complexity of the long time frame that I am analysing (2001-2023) as noted by Nick Cox in #3, I thought it would provide more context to view these data by year or periods of time (e.g., five yearly periods). I created the following to use for this purpose:

Code:

gen byte yr5 = 0 if grant_year == 2001 & !missing(grant_year)
replace yr5 = 1 if inrange(grant_year, 2002, 2006)
replace yr5 = 2 if inrange(grant_year, 2007, 2011)
replace yr5 = 3 if inrange(grant_year, 2012, 2017)
replace yr5 = 4 if inrange(grant_year, 2018, 2023)

label define yr5 1 "[1] 2002 to 2006" 2 "[2] 2007 to 2011" 3 "[3] 2012 to 2017" 4 "[4] 2018 to 2023", modify
    label values yr5 yr5

tab yr5

             yr5 |      Freq.     Percent        Cum.
-----------------+-----------------------------------
               0 |         20        0.02        0.02
[1] 2002 to 2006 |     28,056       26.27       26.29
[2] 2007 to 2011 |     26,101       24.44       50.73
[3] 2012 to 2017 |     27,787       26.02       76.75
[4] 2018 to 2023 |     24,831       23.25      100.00
-----------------+-----------------------------------
           Total |    106,795      100.00

(I note that I excluded 2001 due to very few observations).

I have made a couple of attempts to do this without the desired result.

Code:

egen grant_name_yr = total(grant2), by(grant_year investigator_name)
egen tag_name_yr = tag(investigator_name) // I could easily then swap between 'grant_year' and 'yr5'

gsort -tag_name_yr -grant_name_yr
format %16.0gc grant2 grant_name_yr

list grant_year grant_name_yr investigator_name in 1/20 if tag_name_yr, abbrev(20) noobs

which provides the following:

Code:

 grant_year   grant_name_yr            investigator_name |
  |---------------------------------------------------------|
  |       2003     300,987,008                Adr Tur |
  |       2020      72,452,928                Col Jac |
  |       2020      71,822,240              Bra She |
  |       2023      70,350,000               Nat Cur |
  |       2023      70,298,584                 Sco Fos |
  |---------------------------------------------------------|
  |       2023      70,298,584               Iri Kab |
  |       2017      69,831,296                Mat Dav |
  |       2017      69,782,288                   Jar Co |
  |       2020      68,816,312                  Sim Smi |
  |       2017      68,211,824                Kir Mck |
  |---------------------------------------------------------|
  |       2010      54,195,532              Tor Leh |
  |       2010      53,833,512                 Rob Guy |
  |       2010      53,586,600            Jon Cro |
  |       2010      53,586,600                  Joh Mor |
  |       2010      53,586,600           Jul Qui |
  |---------------------------------------------------------|
  |       2003      50,000,000                Dav Col |
  |       2011      44,070,168                Dav Pan |
  |       2020      42,067,244              Elh Dor |
  |       2023      41,636,032                Shi Qia |
  |       2023      41,613,760             Hei Ebe

I would like to see the years in order with each year displaying the investigator that receives the most grant money (or the most grants) each year from 2002 to 2023 (or for the four periods in 'yr5 if I use that). I appreciate some guidance on how to do this. Kind regards, Chris.
(Stata SE 17.0)

Last edited by Chris Boulis; 15 May 2025, 06:10.

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1391

#19

16 May 2025, 21:24

To list the investigator with the highest number of grants in each year, you could do:

Code:

gsort grant_year -nos_grant
bysort grant_year: gen top_one = (_n == 1)

which produces:

Code:

. list grant_year investigator_name nos_grant if top_one, abbrev(20) sep(0) noobs

  +--------------------------------------------+
  | grant_year   investigator_name   nos_grant |
  |--------------------------------------------|
  |       2019             Joh Car           3 |
  |       2020             Joh Car           3 |
  |       2021             Bry Bor           2 |
  |       2022             Man Che           2 |
  |       2023             Ale Com           2 |
  +--------------------------------------------+

As before though, I would worry about ties. If you want to list all the investigators who had the highest number of grants in a year, I would instead do:

Code:

egen top_one = rank(nos_grant), field by(grant_year)

to produce:

Code:

. list grant_year investigator_name nos_grant if top_one == 1, abbrev(20) sep(0) noobs

  +--------------------------------------------+
  | grant_year   investigator_name   nos_grant |
  |--------------------------------------------|
  |       2019             Joh Car           3 |
  |       2020             Joh Car           3 |
  |       2020             Joh Car           3 |
  |       2021             Bry Bor           2 |
  |       2021             Ale Com           2 |
  |       2022             Man Che           2 |
  |       2022             Ana Roz           2 |
  |       2023             Ale Com           2 |
  |       2023             Ana Roz           2 |
  |       2023             Bry Bor           2 |
  +--------------------------------------------+

Last edited by Hemanshu Kumar; 16 May 2025, 21:30.

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 368
#20

21 May 2025, 08:41

Thank you for the code help Hemanshu Kumar - though I want to list one entry per year - I understand this may be an issue for ties as you note in our small sample here - although my dataset is quite large so may be less of an issue. That said, I'm specifically interested in estimating the largest grant-earning investigators and institutions. As such, I ran the code in #19 - swapping out nos_grant for 'grant_total' to find the top grant-earning investigators for each year between 2001 and 2023,

Code:

gsort grant_year -grant_investigator bysort grant_year: gen top_one = (_n == 1) list grant_year grant_total investigator_name if top_one, abbrev(20) sep(0) noobs

The problem is one of the investigators (in my full dataset) appeared for 14 different years - I compared this with the results of the top 20 grant-earning investigators for the whole period, and that investigator appeared once at the same total grant value so I am confident there is a duplication issue there that may be linked to me assessing investigator names from the 'lead' and 'other' columns and treating them equally in the analysis. As such, I think I should just analyse this information based on the 'lead_investigator' names - to do so, I would just leave out this code

Code:

split other_investigator, gen(investigator) p(;) drop other_investigator rename lead_investigator investigator0 gen `c(obs_t)' x = _n // added - create temp var to uniquely id each obs. reshape long investigator, i(x id) j(num) // i(x id grant2) 'x' added to account for new code replace investigator = trim(itrim(investigator)) drop if missing(investigator) drop x

and revise references to "investigator" in the code back to "lead_investigator"Is there any other code that will need amending? Or is there another solution to this issue? Kind regards, Chris.(Stata SE17.0).
Comment

Announcement

Comment

Comment

Comment

Comment

Comment