Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cumulative frequency of unique values within groups

    Hello statalist,

    My dataset is an unbalanced time series (with repeated years). Each company can have several classifications per year, ranging from 1 to * (i.e., there can be hundreds). Below is a snapshot of the dataset for 1 company with classifications A, B, C

    company year classification
    1 1985 A
    1 1991 A
    1 1993 B
    1 1993 B
    1 1993 B
    1 1994 A
    1 1994 A
    1 1994 B
    1 1994 B
    1 1994 B
    1 1994 C
    1 1994 C
    1 1994 C
    1 1994 C
    1 1994 C

    I would like to create cumulative frequency for the classifications by company and year, however, I would like the count to only sum the same values of classification.

    I started by creating a frequency var:
    . bysort company year classification: gen freq = _N
    Also created a tag:
    . egen tag = tag(company year classification)

    company Year classification freq tag
    1 1985 A 1 1
    1 1991 A 1 1
    1 1993 B 3 1
    1 1993 B 3 0
    1 1993 B 3 0
    1 1994 A 2 1
    1 1994 A 2 0
    1 1994 B 3 1
    1 1994 B 3 0
    1 1994 B 3 0
    1 1994 C 5 1
    1 1994 C 5 0
    1 1994 C 5 0
    1 1994 C 5 0
    1 1994 C 5 0

    Now, summing the freq var by(company year classification) with egen does not produce the cumulative frequency I'm looking for. I would like the cumulative count to refer to the same type of class only. If I call this var 'cum', then this is what I am looking for:

    company Year classification freq tag cum
    1 1985 A 1 1 1
    1 1991 A 1 1 2
    1 1993 B 3 1 3
    1 1993 B 3 0 3
    1 1993 B 3 0 3
    1 1994 A 2 1 4
    1 1994 A 2 0 4
    1 1994 B 3 1 6
    1 1994 B 3 0 6
    1 1994 B 3 0 6
    1 1994 C 5 1 5
    1 1994 C 5 0 5
    1 1994 C 5 0 5
    1 1994 C 5 0 5
    1 1994 C 5 0 5

    How can I create such cumulative var?

    Thank you.

  • #2
    Does this help? Note the use of dataex.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte company int year str1 classification byte(freq tag cum)
    1 1985 "A" 1 1 1
    1 1991 "A" 1 1 2
    1 1993 "B" 3 1 3
    1 1993 "B" 3 0 3
    1 1993 "B" 3 0 3
    1 1994 "A" 2 1 4
    1 1994 "A" 2 0 4
    1 1994 "B" 3 1 6
    1 1994 "B" 3 0 6
    1 1994 "B" 3 0 6
    1 1994 "C" 5 1 5
    1 1994 "C" 5 0 5
    1 1994 "C" 5 0 5
    1 1994 "C" 5 0 5
    1 1994 "C" 5 0 5
    end
    
    bysort tag company class (year) : gen wanted = sum(freq) if tag
    bysort  company class year (tag) : replace wanted = wanted[_N] if missing(wanted)

    Comment


    • #3
      Thank you very much Nick.

      The first line of code created what I needed.

      A follow-up question if I may. If I wanted to create the same cumulative variable (i.e., 'wanted'), but in a 5-year moving window, how should I go about it? The dataset is an unbalanced time series with repeated time values, where (different) year-gaps exist for many companies, sometimes gaps larger than 5 years, such as the example in my original post (1985 and then 1991).

      Can my dataset be used as is? Is there maybe a loop I can write to create the variable?

      Thanks again.

      Comment


      • #4
        If you want sums in 5 year windows, use (e.g.) rangestat from SSC. The same principles apply, use if tag to ensure that distinct (*) values are only used once in the calculation.

        (*) For why "distinct" is a much better term than "unique" see Section 2 of https://www.stata-journal.com/sjpdf....iclenum=dm0042

        Comment


        • #5
          Thank you for the reference.
          I will look into rangestat.

          Comment


          • #6
            Hello,

            I hope it is ok to add another question as it relates to my original one.

            After using the code you suggested (the var 'cum_class' in the table below, is the 'wanted' var in the code:
            Code:
            bysort tag company class (year) : gen wanted = sum(freq) if tag
            ), and adding a cumulative count of total classes per year ('cum_total'), I get:

            company year class freq tag cum_class cum_total
            1 1985 A 1 1 1 1
            1 1991 A 1 1 2 2
            1 1993 B 3 1 3 5
            1 1993 B 3 0 3 5
            1 1993 B 3 0 3 5
            1 1994 A 2 1 4 15
            1 1994 A 2 0 4 15
            1 1994 B 3 1 6 15
            1 1994 B 3 0 6 15
            1 1994 B 3 0 6 15
            1 1994 C 5 1 5 15
            1 1994 C 5 0 5 15
            1 1994 C 5 0 5 15
            1 1994 C 5 0 5 15
            1 1994 C 5 0 5 15

            I would like to calculate a cumulative frequency percent that includes all classes up to each year (and then square it). My problem is that in years where not all types of class appear, simple cumulative percent calculation does not take into account absent classes.

            If we take year 1993 for example, cumulative frequency% (squared) will be 3/5 (0.62), however, what I'm looking for is for all classes up to that year to be taken into account. In which case, for year 1993 I will get: (3/5)2 (for class B) plus (2/5)2 (for class A, from previous years). In year 1994 where all classes appear, this isn't a problem and I can simply sum the values with:

            Code:
            bysort company year: gen sum_pc = sum(cum_pc) if tag==1
            Company year class freq Tag cum_class cum_total cum_pc sum_pc
            1 1985 A 1 1 1 1 1 1
            1 1991 A 1 1 2 2 1 1
            1 1993 B 3 1 3 5 0.36 0.36
            1 1993 B 3 0 3 5 0.36 0.36
            1 1993 B 3 0 3 5 0.36 0.36
            1 1994 A 2 1 4 15 0.0711 0.342
            1 1994 A 2 0 4 15 0.0711 0.342
            1 1994 B 3 1 6 15 0.16 0.342
            1 1994 B 3 0 6 15 0.16 0.342
            1 1994 B 3 0 6 15 0.16 0.342
            1 1994 C 5 1 5 15 0.111 0.342
            1 1994 C 5 0 5 15 0.111 0.342
            1 1994 C 5 0 5 15 0.111 0.342
            1 1994 C 5 0 5 15 0.111 0.342
            1 1994 C 5 0 5 15 0.111 0.342

            I thought maybe I needed to add rows for years such as 1993 that will display classes from previous years with zero frequency for that year:

            company year class freq Tag cum_perclass cum_total cum% sum_%
            1 1985 A 1 1 1 1 1 1
            1 1991 A 1 1 2 2 1 1
            1 1993 A 0 1 2 5 0.16 0.52
            1 1993 B 3 1 3 5 0.36 0.52
            1 1993 B 3 0 3 5 0.36 0.52
            1 1993 B 3 0 3 5 0.36 0.52

            However, I'm not sure how to do that, and maybe there is a simpler more efficient way to achieve this?
            (especially if the number of classes is much larger than 3 presented here)

            Thank you again for your help.

            Comment

            Working...
            X