How to count firm number in panel data?

Boli Yi

Join Date: May 2022

Posts: 3
#1

How to count firm number in panel data?

27 May 2022, 00:11

Hello,

I have some troubles in counting the firm number in panel data. I have 3741 fiscal year observations during 1994-2006. And I want to count how many firms are there in my sample.

I used the command below:
bysort fyear gvkey
count gvkey

"fyear" is the fiscal year and "gvkey" is the company ID.

But there always shows an error as ": required r(100);" in Stata.

So could you please tell me what I am doing wrong? Thank you very much!
Tags: firm number, panel data
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#2

27 May 2022, 00:47

If you want the total number of unique company IDs in your entire dataset, then

Code:

preserve contract gvkey count restore

and if you want the total unique company IDs in each fiscal year, then

Code:

preserve contract fyear gvkey, freq(discard) contract fyear, freq(total_gvkeys) list restore
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

27 May 2022, 01:02

Boli:
if you're interested in the number of companies in your dataset, an alternative to Joseph's helpful advice is:

Code:

use "https://www.stata-press.com/data/r17/nlswork.dta"
. egen uni_idcode=tag(idcode)

. tab uni_idcode

tag(idcode) |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     23,823       83.49       83.49
          1 |      4,711       16.51      100.00
------------+-----------------------------------
      Total |     28,534      100.00

.

The "https://www.stata-press.com/data/r17/nlswork.dta" dataset collected (longitudinal) data on 4,711 employees.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

27 May 2022, 02:22

Answering these questions can be done while keeping the same data layout. distinct from the Stata Journal provides one handle. Here is its default in action.

Code:

. webuse nlswork , clear
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

.
.
. distinct

---------------------------------
          |     total   distinct
----------+----------------------
   idcode |     28534       4711
     year |     28534         15
 birth_yr |     28534         14
      age |     28510         33
     race |     28534          3
      msp |     28518          2
  nev_mar |     28518          2
    grade |     28532         19
 collgrad |     28534          2
 not_smsa |     28526          2
   c_city |     28526          2
    south |     28526          2
 ind_code |     28193         12
 occ_code |     28413         13
    union |     19238          2
   wks_ue |     22830         61
  ttl_exp |     28534       4744
   tenure |     28101        270
    hours |     28467         85
 wks_work |     27831        105
  ln_wage |     28534       8173
---------------------------------

.
To find out more, do this. You will see the latest update at the time of writing and the original 2008 paper

Code:

. search distinct, sj

Search of official help files, FAQs, Examples, and Stata Journals

SJ-20-4 dm0042_3  . . . . . . . . . . . . . . . . Software update for distinct
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q4/20   SJ 20(4):1028--1030
        sort() option has been added

SJ-20-2 pr0046_1  . . . . . . . . . . .  Speaking Stata: More ways for rowwise
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q2/20   SJ 20(2):481--488                                (no commands)
        focuses on returning which variable or variables are equal
        to the maximum or minimum in a row

SJ-15-3 dm0042_2  . . . . . . . . . . . . . . . . Software update for distinct
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q3/15   SJ 15(3):899
        improved table format and display of large numbers of
        observations

SJ-12-2 dm0042_1  . . . . . . . . . . . . . . . . Software update for distinct
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q2/12   SJ 12(2):352
        options added to restrict output to variables with a minimum
        or maximum of distinct values

SJ-11-2 dm0057  . . . . . . . . .  Stata tip 99: Taking extra care with encode
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Schechter
        Q2/11   SJ 11(2):321--322                                (no commands)
        tip on safely using encode across datasets

SJ-9-1  pr0046  . . . . . . . . . . . . . . . . . . .  Speaking Stata: Rowwise
        (help rowsort, rowranks if installed) . . . . . . . . . . .  N. J. Cox
        Q1/09   SJ 9(1):137--157
        shows how to exploit functions, egen functions, and Mata
        for working rowwise; rowsort and rowranks are introduced

SJ-8-4  dm0042  . . . . . . . . . . . .  Speaking Stata: Distinct observations
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q4/08   SJ 8(4):557--568
        shows how to answer questions about distinct observations
        from first principles; provides a convenience command

Although distinct does what it does, questions here on Statalist often have this two-step (or three-step) answer in the spirit of @Carlo Lazzaro's suggestion. See the 2008 paper for more discussion.

Code:

. egen tag = tag(ind_code idcode)

. egen distinct = total(tag) , by(ind_code)

. tabdisp ind_code, c(distinct) stubwidth(12)

-------------------------
Industry of  |
employment   |   distinct
-------------+-----------
           1 |        129
           2 |         25
           3 |        136
           4 |       1563
           5 |        418
           6 |       1768
           7 |        717
           8 |        495
           9 |        800
          10 |        128
          11 |       2318
          12 |        556
           . |          0
-------------------------

In the spirit of precision, or if you prefer of pedantry, I note that unique still carries the main flavour of occurring once only, and as implied by the command name I recommend the word distinct in this context. Again, this is discussed at greater length in the 2008 paper.

The tagging idea is an old one. I guess it became most vivid to me in reading about APL in the middle 1970s: APL was a language which made very heavy use of 0. 1 logical or Boolean variables, although just creating such variables is easy enough so long as numeric values are supported. In a Stata context egen, tag() goes back to 1999, and perhaps before that grew out of Statalist exchanges, although the archives from that time disappeared long ago, as have most of my memories of what was discussed.

Here is the nub of the matter.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str3 answer
"cat"
"cat"
"cat"
"dog"
"dog"
end

 bysort answer : gen tag = _n == 1

list

     +--------------+
     | answer   tag |
     |--------------|
  1. |    cat     1 |
  2. |    cat     0 |
  3. |    cat     0 |
  4. |    dog     1 |
  5. |    dog     0 |
     +--------------+


su tag,  meanonly

di r(sum)
2

In the variable answer we want to count distinct values. We can do that by first assigning scores 1 just once and 0 otherwise to each distinct answer. There are two systematic ways of doing that: to tag the first occurrence or to tag the last occurrence, which will be the same if there is only one occurrence. (Manifestly, other rules we might think up like tagging the second occurrence or the seventh occurrence will fail if there are fewer than 2 or 7 occurrences; more positively put, our rule is determined by what will work with singleton answers that occur just once.)

Once we have such a score, we just look for the sum of the scores (which adds up 1s and 0s, but clearly the 0s don't count, in more ways than one). In our toy problem summarize will give it to us.

In the very common case with some grouping variable G and wanting to know how many distinct values of V there are for each distinct value of G, it is convenient (but not essential) to reach for egen functions tag() and total() to do the work.

Omniscient readers will know that a dedicated nvals() function has been lurking within egenmore on SSC for about 20 years, but this way of doing it helps underscore the logic.

Last edited by Nick Cox; 27 May 2022, 02:54.

Comment

Boli Yi

Join Date: May 2022

Posts: 3
#5

27 May 2022, 22:11

Thank you very much for your replies, Joseph, Carlo and Nick. The methods above are very useful and I have learned a lot by reading your replies and operating them into my study. I have got total number of firms (835) occured at least once in my sample and the number of firms in each fiscal year:

+------------------+
| fyear total_~s |
|------------------|
1. | 1994 263 |
2. | 1995 302 |
3. | 1996 308 |
4. | 1997 307 |
5. | 1998 305 |
|------------------|
6. | 1999 310 |
7. | 2000 301 |
8. | 2001 280 |
9. | 2002 272 |
10. | 2003 241 |
|------------------|
11. | 2004 289 |
12. | 2005 292 |
13. | 2006 271 |
+------------------+

And now I would like to ask one more question. How to get the number of firms recurred in all fiscal years?

Thank you very much. Hope you all have a good day!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#6

28 May 2022, 01:30

Your period of record spans 13 years, So you want an indicator for being present for all 13 years. Pick any year to count the number of such firms.

Code:

bysort gvkey : gen ishighest = _N == 13 count if ishighest & fyear == 2006
2 likes
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

29 May 2022, 05:25

Boli:
in terms of efficiency, Nick's code has an edge over what follows (tongue-in-cheek: strange indeed!

), that can be considered another take on the very same issue (the panel dataset stretches over 15 years):

Code:

. use "https://www.stata-press.com/data/r17/nlswork.dta"
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. tab year

  Interview |
       year |      Freq.     Percent        Cum.
------------+-----------------------------------
         68 |      1,375        4.82        4.82
         69 |      1,232        4.32        9.14
         70 |      1,686        5.91       15.05
         71 |      1,851        6.49       21.53
         72 |      1,693        5.93       27.47
         73 |      1,981        6.94       34.41
         75 |      2,141        7.50       41.91
         77 |      2,171        7.61       49.52
         78 |      1,964        6.88       56.40
         80 |      1,847        6.47       62.88
         82 |      2,085        7.31       70.18
         83 |      1,987        6.96       77.15
         85 |      2,085        7.31       84.45
         87 |      2,164        7.58       92.04
         88 |      2,272        7.96      100.00
------------+-----------------------------------
      Total |     28,534      100.00


. bysort idcode: egen wanted=count(idcode)


. tab wanted

     wanted |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        547        1.92        1.92
          2 |        996        3.49        5.41
          3 |      1,452        5.09       10.50
          4 |      1,644        5.76       16.26
          5 |      2,105        7.38       23.63
          6 |      2,388        8.37       32.00
          7 |      2,415        8.46       40.47
          8 |      2,584        9.06       49.52
          9 |      2,718        9.53       59.05
         10 |      2,700        9.46       68.51
         11 |      2,222        7.79       76.30
         12 |      1,896        6.64       82.94
         13 |      1,911        6.70       89.64
         14 |      1,666        5.84       95.48
         15 |      1,290        4.52      100.00
------------+-----------------------------------
      Total |     28,534      100.00

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

29 May 2022, 05:35

The identifier that occurs most frequently will often but not necessarily be one present in all years.

In this case, Carlo has checked that it's the same number, so fine.

Code:

bysort idcode : gen wanted = _N

is even more direct.

EDIT: The code had too many colons.

Last edited by Nick Cox; 29 May 2022, 06:21.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#9

29 May 2022, 05:54

Nick is clearly right.
I'll use the -colon- in excess at the first -bysort- occasion .

Last edited by Carlo Lazzaro; 29 May 2022, 05:56.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#10

29 May 2022, 06:23

Everyone needs their software to DWIM. http://www.catb.org/jargon/html/D/DWIM.html
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#11

29 May 2022, 06:49

Great indeed, Nick!
I was not aware of that link.

Kind regards,
Carlo
(Stata 19.0)
Comment
Boli Yi

Join Date: May 2022

Posts: 3
#12

30 May 2022, 00:20

I am sorry for the late reply. Thank you very much, Nick and Carlo. Your advices are really useful. I will apply them into my study. Thanks again!
Comment

Announcement