Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a yearly count variable based on a string identifier

    Hi Everyone,

    I am looking to create a yearly count and cumulative sum variable based on a string identifier, CUSIP. Following is the example generated by dataex:

    input long patent_num int cites str8 CUSIP str4 year
    3915372 16 "00003210" "1975"
    3946915 37 "00003210" "1976"
    3935990 20 "00003210" "1976"
    3935991 17 "00003210" "1976"
    4039119 12 "00003210" "1977"
    4039121 19 "00003210" "1977"
    5839505 34 "00036020" "1998"
    5826641 24 "00036020" "1998"
    5738167 10 "00036020" "1998"

    So for example, for the CUSIP code 00036020, the count for patent would be 3 and the cumulative cite would be 68 (34+24+10). What should be the appropiate code to generate such count variables for inputs like patent, cites etc?

    Thanks in advance!

  • #2
    You could either
    Code:
    collapse (count) citation_tally = cites (sum) citation_total = cites, by(CUSIP)
    or if you wanted to clutter up the current dataset, then
    Code:
    bysort CUSIP: generate int citation_tally = _N
    by CUSIP: generate int citation_total = sum(cites)
    by CUSIP: replace citation_total = citation_total[_N]

    Comment


    • #3
      To make it yearly, add year to each of the -by-s.
      Code:
      collapse . . ., by(CUSIP year)
      bysort CUSIP year: . . .
      by CUSIP year: . . .

      Comment


      • #4
        You'll need to

        Code:
        destring year, replace
        sooner or later.

        Comment


        • #5
          What Joseph shows in #2 is how an advanced Stata user would do it. The sooner one learns these techniques the better.

          However a less advanced user, or a user who does not pay close attention, might easily hurt himself with such techniques.

          It is generally safer and requiring less attention to use the -egen- set of functions, because they are designed so that we do not hurt ourselves when we do not really know what we are doing, or we are not very attentive.

          Code:
          . egen count = count(cites), by(CUSIP)
          
          . bysort CUSIP: generate int citation_tally = _N
          
          . egen total = total(patent), by(CUSIP)
          
          . dataex
          
          ----------------------- copy starting from the next line -----------------------
          
          
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long patent_num int cites str8 CUSIP str4 year float count int citation_tally float total
          3915372 16 "00003210" "1975" 5 6 23812508
          3946915 37 "00003210" "1976" 5 6 23812508
          3935990 20 "00003210" "1976" 5 6 23812508
          3935991 17 "00003210" "1976" 5 6 23812508
          4039119  . "00003210" "1977" 5 6 23812508
          4039121 19 "00003210" "1977" 5 6 23812508
          5839505 34 "00036020" "1998" 3 3 17404312
          5826641 24 "00036020" "1998" 3 3 17404312
          5738167 10 "00036020" "1998" 3 3 17404312
          end
          ------------------ copy up to and including the previous line ------------------ Listed 9 out of 9 observations . summ count citation Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- count | 9 4.333333 1 3 5 citation_t~y | 9 5 1.5 3 6 .
          I introduced one missing value in the cites to show that the native way of counting by referring to _N, which works totally fine for the original data presented, backfires if we have missing values among the variable we are counting.

          Comment

          Working...
          X