Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Building a fraction variable cumulatively

    Dear all users of Statalist,

    I have a question about how to build a variable of cumulative fractions. Below is my sample data

    Code:
    clear
    input byte id int year str1 class
    1 2015 "a"
    1 2015 "a"
    1 2015 "a"
    1 2015 "b"
    1 2015 "b"
    1 2015 "c"
    1 2016 "a"
    1 2016 "a"
    1 2016 "b"
    1 2016 "b"
    1 2016 "c"
    1 2016 "d"
    1 2016 "d"
    1 2017 "a"
    1 2017 "a"
    1 2017 "a"
    1 2017 "d"
    1 2017 "d"
    1 2017 "e"
    1 2018 "b"
    1 2018 "e"
    1 2018 "f"
    end
    Below is the result that I want to get from the sample data.

    Code:
    clear
    input byte id int year str1 class byte(cum_appearance cum_total) float cumulative_share
    1 2015 "a" 3  6        .5
    1 2015 "b" 2  6  .3333333
    1 2015 "c" 1  6 .16666667
    1 2016 "a" 5 13  .3846154
    1 2016 "b" 4 13  .3076923
    1 2016 "c" 2 13 .15384616
    1 2016 "d" 2 13 .15384616
    1 2017 "a" 8 19  .4210526
    1 2017 "b" 4 19  .2105263
    1 2017 "c" 2 19 .10526316
    1 2017 "d" 4 19  .2105263
    1 2017 "e" 1 19 .05263158
    1 2018 "a" 8 22  .3636364
    1 2018 "b" 5 22 .22727273
    1 2018 "c" 2 22  .0909091
    1 2018 "d" 4 22  .1818182
    1 2018 "e" 2 22  .0909091
    1 2018 "f" 1 22 .04545455
    end
    That is, for each "id", my goal is to build a variable "cumulative_share", which is the cumulative fraction of alphabets in "class".

    For example, there are 19 observations from 2015 to 2017, and "a" appears 8 times for the same periods. so the "share" of "a" at 2017 must be 8/19.
    Following the same way, the share of "a" at 2018 must be 8/22.

    The problem is that there could be missing alphabets.
    For example, "a" and "c" do not appear in 2018, but their cumulative fraction should be calculated at 2018, to make sure that sum of cumulative_share by each year must be 1.

    I have tried to solve the issue by using for-loop, but could not find the solution.
    I really thank any sage advice on my stata issue.

  • #2
    No loops needed:

    Code:
    by id year, sort: gen cum_total = _N if _n == 1
    by id: replace cum_total = sum(cum_total)
    
    by id class year, sort: gen cum_appearances = _N if _n == 1
    by id class: replace cum_appearances = sum(cum_appearances)
    
    gen share = cum_appearances/cum_total
    
    by id year class, sort: keep if _n == 1

    Comment


    • #3
      You can use the fillin command to create observations when class does not appear in the year. You can then use rangestat (from SSC) to calculate the number of appearance using a recursive window of time (all observations up to and including the current year). You then drop the filled in observation that did not contribute to the count and reduce to one observation per group of id year class.

      Code:
      * create filler obs for when class is missing
      gen touse = 1
      fillin id class year
      
      * the cumulative total
      rangestat (count) ctot = touse, interval(year . 0) by(id)
      
      * the cumulative total per class
      rangestat (count) ctotc = touse, interval(year . 0) by(id class)
      
      * reduce to one observation per -id class year-
      drop if ctotc == 0
      bysort id year class: keep if _n == 1
      
      * calculate the desired measure
      gen cshare = ctotc / ctot
      
      list, sepby(id year)
      and the list results:
      Code:
      . list, sepby(id year)
      
           +---------------------------------------------------------------+
           | id   year   class   touse   _fillin   ctot   ctotc     cshare |
           |---------------------------------------------------------------|
        1. |  1   2015       a       1         0      6       3         .5 |
        2. |  1   2015       b       1         0      6       2   .3333333 |
        3. |  1   2015       c       1         0      6       1   .1666667 |
           |---------------------------------------------------------------|
        4. |  1   2016       a       1         0     13       5   .3846154 |
        5. |  1   2016       b       1         0     13       4   .3076923 |
        6. |  1   2016       c       1         0     13       2   .1538462 |
        7. |  1   2016       d       1         0     13       2   .1538462 |
           |---------------------------------------------------------------|
        8. |  1   2017       a       1         0     19       8   .4210526 |
        9. |  1   2017       b       .         1     19       4   .2105263 |
       10. |  1   2017       c       .         1     19       2   .1052632 |
       11. |  1   2017       d       1         0     19       4   .2105263 |
       12. |  1   2017       e       1         0     19       1   .0526316 |
           |---------------------------------------------------------------|
       13. |  1   2018       a       .         1     22       8   .3636364 |
       14. |  1   2018       b       1         0     22       5   .2272727 |
       15. |  1   2018       c       .         1     22       2   .0909091 |
       16. |  1   2018       d       .         1     22       4   .1818182 |
       17. |  1   2018       e       1         0     22       2   .0909091 |
       18. |  1   2018       f       1         0     22       1   .0454545 |
           +---------------------------------------------------------------+
      
      .
      I'm sure that Clyde's approach can be retrofitted to the filled in data.

      Comment


      • #4
        Thanks a lot for your helpful solutions!

        Comment

        Working...
        X