Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting distinct values cumulatively by group

    Dear Statalist,

    I have a question about constructing count variable. Please refer to my sample data below.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 id int year str1 class byte cumulative_count
    "a" 2015 "A" 3
    "a" 2015 "A" 3
    "a" 2015 "B" 3
    "a" 2015 "C" 3
    "a" 2016 "A" 3
    "a" 2016 "A" 3
    "a" 2016 "B" 3
    "a" 2017 "B" 4
    "a" 2017 "D" 4
    "a" 2018 "C" 5
    "a" 2018 "E" 5
    "b" 2015 "A" 1
    "b" 2016 "B" 2
    "b" 2017 "A" 2
    "b" 2017 "B" 2
    "b" 2018 "C" 3
    end
    I want to have a variable 'cumulative_count', which cumulatively counts distinct value of alphabets in 'class' grouped by id.

    For example, id "a" has three distinct alphabets "A, B, C" until 2015. So the value of "cumulative_count" in 2015 should be 3.
    Until 2018, id "a" has five different alphabets "A, B, C, D, E". So the value of "cumulative_count" in 2018 should be 5.

    This might be covered in statalist before. I tried to find the solution from previous threads, but I could not find one.

    Thanks a lot for any advice in advance.
    Last edited by Andrew Choi; 05 Mar 2019, 09:09.

  • #2
    Thanks for the data example.

    Code:
    search distinct
    in Stata throws up (together with some sources not relevant here)

    FAQ . . . . . . . . . . . . . . Calculating the number of distinct values
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
    9/06 How do I calculate the number of distinct
    values seen so far?
    http://www.stata.com/support/faqs/data-management/calculating-number-of-distinct-values/

    SJ-8-4 dm0042 . . . . . . . . . . . . Speaking Stata: Distinct observations
    (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton
    Q4/08 SJ 8(4):557--568
    shows how to answer questions about distinct observations
    from first principles; provides a convenience command

    The latter is at https://www.stata-journal.com/sjpdf....iclenum=dm0042 (especially Section 5).

    Here is one way to get what you want. The easy principle is to tag each class with 1 the first time it is seen and then cumulate those. Other observations are tagged with 0 (or missing if you prefer) and do not affect the cumulative sum. Getting the sort order right is the awkward part.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 id int year str1 class byte cumulative_count
    "a" 2015 "A" 3
    "a" 2015 "A" 3
    "a" 2015 "B" 3
    "a" 2015 "C" 3
    "a" 2016 "A" 3
    "a" 2016 "A" 3
    "a" 2016 "B" 3
    "a" 2017 "B" 4
    "a" 2017 "D" 4
    "a" 2018 "C" 5
    "a" 2018 "E" 5
    "b" 2015 "A" 1
    "b" 2016 "B" 2
    "b" 2017 "A" 2
    "b" 2017 "B" 2
    "b" 2018 "C" 3
    end
    
    bysort id class (year) : gen wanted = _n == 1
    bysort id (year class): replace wanted = sum(wanted)
    by id year: replace wanted = wanted[_N]
    assert wanted == cumulative_count
    list, sepby(id year)
    
        +---------------------------------------+
         | id   year   class   cumula~t   wanted |
         |---------------------------------------|
      1. |  a   2015       A          3        3 |
      2. |  a   2015       A          3        3 |
      3. |  a   2015       B          3        3 |
      4. |  a   2015       C          3        3 |
         |---------------------------------------|
      5. |  a   2016       A          3        3 |
      6. |  a   2016       A          3        3 |
      7. |  a   2016       B          3        3 |
         |---------------------------------------|
      8. |  a   2017       B          4        4 |
      9. |  a   2017       D          4        4 |
         |---------------------------------------|
     10. |  a   2018       C          5        5 |
     11. |  a   2018       E          5        5 |
         |---------------------------------------|
     12. |  b   2015       A          1        1 |
         |---------------------------------------|
     13. |  b   2016       B          2        2 |
         |---------------------------------------|
     14. |  b   2017       A          2        2 |
     15. |  b   2017       B          2        2 |
         |---------------------------------------|
     16. |  b   2018       C          3        3 |
         +---------------------------------------+
    Last edited by Nick Cox; 05 Mar 2019, 10:26.

    Comment


    • #3
      Thanks a lot for your kind reply, Nick!

      Comment

      Working...
      X