Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to count firm number in panel data?

    Hello,

    I have some troubles in counting the firm number in panel data. I have 3741 fiscal year observations during 1994-2006. And I want to count how many firms are there in my sample.

    I used the command below:
    bysort fyear gvkey
    count gvkey

    "fyear" is the fiscal year and "gvkey" is the company ID.

    But there always shows an error as ": required r(100);" in Stata.

    So could you please tell me what I am doing wrong? Thank you very much!




  • #2
    If you want the total number of unique company IDs in your entire dataset, then
    Code:
    preserve
    contract gvkey
    count
    restore
    and if you want the total unique company IDs in each fiscal year, then
    Code:
    preserve
    contract fyear gvkey, freq(discard)
    contract fyear, freq(total_gvkeys)
    list
    restore

    Comment


    • #3
      Boli:
      if you're interested in the number of companies in your dataset, an alternative to Joseph's helpful advice is:
      Code:
      use "https://www.stata-press.com/data/r17/nlswork.dta"
      . egen uni_idcode=tag(idcode)
      
      . tab uni_idcode
      
      tag(idcode) |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                0 |     23,823       83.49       83.49
                1 |      4,711       16.51      100.00
      ------------+-----------------------------------
            Total |     28,534      100.00
      
      .
      The "https://www.stata-press.com/data/r17/nlswork.dta" dataset collected (longitudinal) data on 4,711 employees.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Answering these questions can be done while keeping the same data layout. distinct from the Stata Journal provides one handle. Here is its default in action.


        Code:
        . webuse nlswork , clear
        (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
        
        .
        .
        . distinct
        
        ---------------------------------
                  |     total   distinct
        ----------+----------------------
           idcode |     28534       4711
             year |     28534         15
         birth_yr |     28534         14
              age |     28510         33
             race |     28534          3
              msp |     28518          2
          nev_mar |     28518          2
            grade |     28532         19
         collgrad |     28534          2
         not_smsa |     28526          2
           c_city |     28526          2
            south |     28526          2
         ind_code |     28193         12
         occ_code |     28413         13
            union |     19238          2
           wks_ue |     22830         61
          ttl_exp |     28534       4744
           tenure |     28101        270
            hours |     28467         85
         wks_work |     27831        105
          ln_wage |     28534       8173
        ---------------------------------
        .
        To find out more, do this. You will see the latest update at the time of writing and the original 2008 paper

        Code:
        . search distinct, sj
        
        Search of official help files, FAQs, Examples, and Stata Journals
        
        SJ-20-4 dm0042_3  . . . . . . . . . . . . . . . . Software update for distinct
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q4/20   SJ 20(4):1028--1030
                sort() option has been added
        
        SJ-20-2 pr0046_1  . . . . . . . . . . .  Speaking Stata: More ways for rowwise
                . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
                Q2/20   SJ 20(2):481--488                                (no commands)
                focuses on returning which variable or variables are equal
                to the maximum or minimum in a row
        
        SJ-15-3 dm0042_2  . . . . . . . . . . . . . . . . Software update for distinct
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q3/15   SJ 15(3):899
                improved table format and display of large numbers of
                observations
        
        SJ-12-2 dm0042_1  . . . . . . . . . . . . . . . . Software update for distinct
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q2/12   SJ 12(2):352
                options added to restrict output to variables with a minimum
                or maximum of distinct values
        
        SJ-11-2 dm0057  . . . . . . . . .  Stata tip 99: Taking extra care with encode
                . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Schechter
                Q2/11   SJ 11(2):321--322                                (no commands)
                tip on safely using encode across datasets
        
        SJ-9-1  pr0046  . . . . . . . . . . . . . . . . . . .  Speaking Stata: Rowwise
                (help rowsort, rowranks if installed) . . . . . . . . . . .  N. J. Cox
                Q1/09   SJ 9(1):137--157
                shows how to exploit functions, egen functions, and Mata
                for working rowwise; rowsort and rowranks are introduced
        
        SJ-8-4  dm0042  . . . . . . . . . . . .  Speaking Stata: Distinct observations
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q4/08   SJ 8(4):557--568
                shows how to answer questions about distinct observations
                from first principles; provides a convenience command
        Although distinct does what it does, questions here on Statalist often have this two-step (or three-step) answer in the spirit of @Carlo Lazzaro's suggestion. See the 2008 paper for more discussion.

        Code:
        . egen tag = tag(ind_code idcode)
        
        . egen distinct = total(tag) , by(ind_code)
        
        . tabdisp ind_code, c(distinct) stubwidth(12)
        
        -------------------------
        Industry of  |
        employment   |   distinct
        -------------+-----------
                   1 |        129
                   2 |         25
                   3 |        136
                   4 |       1563
                   5 |        418
                   6 |       1768
                   7 |        717
                   8 |        495
                   9 |        800
                  10 |        128
                  11 |       2318
                  12 |        556
                   . |          0
        -------------------------
        In the spirit of precision, or if you prefer of pedantry, I note that unique still carries the main flavour of occurring once only, and as implied by the command name I recommend the word distinct in this context. Again, this is discussed at greater length in the 2008 paper.

        The tagging idea is an old one. I guess it became most vivid to me in reading about APL in the middle 1970s: APL was a language which made very heavy use of 0. 1 logical or Boolean variables, although just creating such variables is easy enough so long as numeric values are supported. In a Stata context egen, tag() goes back to 1999, and perhaps before that grew out of Statalist exchanges, although the archives from that time disappeared long ago, as have most of my memories of what was discussed.

        Here is the nub of the matter.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str3 answer
        "cat"
        "cat"
        "cat"
        "dog"
        "dog"
        end
        
         bysort answer : gen tag = _n == 1
        
        list
        
             +--------------+
             | answer   tag |
             |--------------|
          1. |    cat     1 |
          2. |    cat     0 |
          3. |    cat     0 |
          4. |    dog     1 |
          5. |    dog     0 |
             +--------------+
        
        
        su tag,  meanonly
        
        di r(sum)
        2

        In the variable answer we want to count distinct values. We can do that by first assigning scores 1 just once and 0 otherwise to each distinct answer. There are two systematic ways of doing that: to tag the first occurrence or to tag the last occurrence, which will be the same if there is only one occurrence. (Manifestly, other rules we might think up like tagging the second occurrence or the seventh occurrence will fail if there are fewer than 2 or 7 occurrences; more positively put, our rule is determined by what will work with singleton answers that occur just once.)

        Once we have such a score, we just look for the sum of the scores (which adds up 1s and 0s, but clearly the 0s don't count, in more ways than one). In our toy problem summarize will give it to us.

        In the very common case with some grouping variable G and wanting to know how many distinct values of V there are for each distinct value of G, it is convenient (but not essential) to reach for egen functions tag() and total() to do the work.

        Omniscient readers will know that a dedicated nvals() function has been lurking within egenmore on SSC for about 20 years, but this way of doing it helps underscore the logic.
        Last edited by Nick Cox; 27 May 2022, 02:54.

        Comment


        • #5
          Thank you very much for your replies, Joseph, Carlo and Nick. The methods above are very useful and I have learned a lot by reading your replies and operating them into my study. I have got total number of firms (835) occured at least once in my sample and the number of firms in each fiscal year:


          +------------------+
          | fyear total_~s |
          |------------------|
          1. | 1994 263 |
          2. | 1995 302 |
          3. | 1996 308 |
          4. | 1997 307 |
          5. | 1998 305 |
          |------------------|
          6. | 1999 310 |
          7. | 2000 301 |
          8. | 2001 280 |
          9. | 2002 272 |
          10. | 2003 241 |
          |------------------|
          11. | 2004 289 |
          12. | 2005 292 |
          13. | 2006 271 |
          +------------------+


          And now I would like to ask one more question. How to get the number of firms recurred in all fiscal years?



          Thank you very much. Hope you all have a good day!

          Comment


          • #6
            Your period of record spans 13 years, So you want an indicator for being present for all 13 years. Pick any year to count the number of such firms.

            Code:
            bysort gvkey : gen ishighest = _N == 13 
            count if ishighest & fyear == 2006

            Comment


            • #7
              Boli:
              in terms of efficiency, Nick's code has an edge over what follows (tongue-in-cheek: strange indeed! ), that can be considered another take on the very same issue (the panel dataset stretches over 15 years):
              Code:
              . use "https://www.stata-press.com/data/r17/nlswork.dta"
              (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
              
              . tab year
              
                Interview |
                     year |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                       68 |      1,375        4.82        4.82
                       69 |      1,232        4.32        9.14
                       70 |      1,686        5.91       15.05
                       71 |      1,851        6.49       21.53
                       72 |      1,693        5.93       27.47
                       73 |      1,981        6.94       34.41
                       75 |      2,141        7.50       41.91
                       77 |      2,171        7.61       49.52
                       78 |      1,964        6.88       56.40
                       80 |      1,847        6.47       62.88
                       82 |      2,085        7.31       70.18
                       83 |      1,987        6.96       77.15
                       85 |      2,085        7.31       84.45
                       87 |      2,164        7.58       92.04
                       88 |      2,272        7.96      100.00
              ------------+-----------------------------------
                    Total |     28,534      100.00
              
              
              . bysort idcode: egen wanted=count(idcode)
              
              
              . tab wanted
              
                   wanted |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        1 |        547        1.92        1.92
                        2 |        996        3.49        5.41
                        3 |      1,452        5.09       10.50
                        4 |      1,644        5.76       16.26
                        5 |      2,105        7.38       23.63
                        6 |      2,388        8.37       32.00
                        7 |      2,415        8.46       40.47
                        8 |      2,584        9.06       49.52
                        9 |      2,718        9.53       59.05
                       10 |      2,700        9.46       68.51
                       11 |      2,222        7.79       76.30
                       12 |      1,896        6.64       82.94
                       13 |      1,911        6.70       89.64
                       14 |      1,666        5.84       95.48
                       15 |      1,290        4.52      100.00
              ------------+-----------------------------------
                    Total |     28,534      100.00
              
              .
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                The identifier that occurs most frequently will often but not necessarily be one present in all years.

                In this case, Carlo has checked that it's the same number, so fine.

                Code:
                bysort idcode : gen wanted = _N 
                is even more direct.


                EDIT: The code had too many colons.
                Last edited by Nick Cox; 29 May 2022, 06:21.

                Comment


                • #9
                  Nick is clearly right.
                  I'll use the -colon- in excess at the first -bysort- occasion .
                  Last edited by Carlo Lazzaro; 29 May 2022, 05:56.
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Everyone needs their software to DWIM. http://www.catb.org/jargon/html/D/DWIM.html

                    Comment


                    • #11
                      Great indeed, Nick!
                      I was not aware of that link.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        I am sorry for the late reply. Thank you very much, Nick and Carlo. Your advices are really useful. I will apply them into my study. Thanks again!

                        Comment

                        Working...
                        X