Generating a yearly count variable based on a string identifier

Wali Ullah

Join Date: Mar 2019

Posts: 51
#1

Generating a yearly count variable based on a string identifier

03 Apr 2021, 21:36

Hi Everyone,

I am looking to create a yearly count and cumulative sum variable based on a string identifier, CUSIP. Following is the example generated by dataex:

input long patent_num int cites str8 CUSIP str4 year
3915372 16 "00003210" "1975"
3946915 37 "00003210" "1976"
3935990 20 "00003210" "1976"
3935991 17 "00003210" "1976"
4039119 12 "00003210" "1977"
4039121 19 "00003210" "1977"
5839505 34 "00036020" "1998"
5826641 24 "00036020" "1998"
5738167 10 "00036020" "1998"

So for example, for the CUSIP code 00036020, the count for patent would be 3 and the cumulative cite would be 68 (34+24+10). What should be the appropiate code to generate such count variables for inputs like patent, cites etc?

Thanks in advance!
Tags: None

Joseph Coveney

Join Date: Apr 2014
Posts: 4409

03 Apr 2021, 22:06

You could either

Code:

collapse (count) citation_tally = cites (sum) citation_total = cites, by(CUSIP)

or if you wanted to clutter up the current dataset, then

Code:

bysort CUSIP: generate int citation_tally = _N
by CUSIP: generate int citation_total = sum(cites)
by CUSIP: replace citation_total = citation_total[_N]

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4409
#3

04 Apr 2021, 00:52

To make it yearly, add year to each of the -by-s.

Code:

collapse . . ., by(CUSIP year) bysort CUSIP year: . . . by CUSIP year: . . .
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35685
#4

04 Apr 2021, 03:26

You'll need to

Code:

destring year, replace

sooner or later.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

04 Apr 2021, 06:59

What Joseph shows in #2 is how an advanced Stata user would do it. The sooner one learns these techniques the better.

However a less advanced user, or a user who does not pay close attention, might easily hurt himself with such techniques.

It is generally safer and requiring less attention to use the -egen- set of functions, because they are designed so that we do not hurt ourselves when we do not really know what we are doing, or we are not very attentive.

Code:

. egen count = count(cites), by(CUSIP)

. bysort CUSIP: generate int citation_tally = _N

. egen total = total(patent), by(CUSIP)

. dataex

----------------------- copy starting from the next line -----------------------


	Code:
	* Example generated by -dataex-. To install: ssc install dataex
clear
input long patent_num int cites str8 CUSIP str4 year float count int citation_tally float total
3915372 16 "00003210" "1975" 5 6 23812508
3946915 37 "00003210" "1976" 5 6 23812508
3935990 20 "00003210" "1976" 5 6 23812508
3935991 17 "00003210" "1976" 5 6 23812508
4039119  . "00003210" "1977" 5 6 23812508
4039121 19 "00003210" "1977" 5 6 23812508
5839505 34 "00036020" "1998" 3 3 17404312
5826641 24 "00036020" "1998" 3 3 17404312
5738167 10 "00036020" "1998" 3 3 17404312
end
------------------ copy up to and including the previous line ------------------

Listed 9 out of 9 observations

. summ count citation

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       count |          9    4.333333           1          3          5
citation_t~y |          9           5         1.5          3          6

.

I introduced one missing value in the cites to show that the native way of counting by referring to _N, which works totally fine for the original data presented, backfires if we have missing values among the variable we are counting.

Announcement