Building a fraction variable cumulatively

Andrew Choi

Join Date: Mar 2019

Posts: 6
#1

Building a fraction variable cumulatively

05 Mar 2019, 14:52

Dear all users of Statalist,

I have a question about how to build a variable of cumulative fractions. Below is my sample data

Code:

clear input byte id int year str1 class 1 2015 "a" 1 2015 "a" 1 2015 "a" 1 2015 "b" 1 2015 "b" 1 2015 "c" 1 2016 "a" 1 2016 "a" 1 2016 "b" 1 2016 "b" 1 2016 "c" 1 2016 "d" 1 2016 "d" 1 2017 "a" 1 2017 "a" 1 2017 "a" 1 2017 "d" 1 2017 "d" 1 2017 "e" 1 2018 "b" 1 2018 "e" 1 2018 "f" end

Below is the result that I want to get from the sample data.

Code:

clear input byte id int year str1 class byte(cum_appearance cum_total) float cumulative_share 1 2015 "a" 3 6 .5 1 2015 "b" 2 6 .3333333 1 2015 "c" 1 6 .16666667 1 2016 "a" 5 13 .3846154 1 2016 "b" 4 13 .3076923 1 2016 "c" 2 13 .15384616 1 2016 "d" 2 13 .15384616 1 2017 "a" 8 19 .4210526 1 2017 "b" 4 19 .2105263 1 2017 "c" 2 19 .10526316 1 2017 "d" 4 19 .2105263 1 2017 "e" 1 19 .05263158 1 2018 "a" 8 22 .3636364 1 2018 "b" 5 22 .22727273 1 2018 "c" 2 22 .0909091 1 2018 "d" 4 22 .1818182 1 2018 "e" 2 22 .0909091 1 2018 "f" 1 22 .04545455 end

That is, for each "id", my goal is to build a variable "cumulative_share", which is the cumulative fraction of alphabets in "class".

For example, there are 19 observations from 2015 to 2017, and "a" appears 8 times for the same periods. so the "share" of "a" at 2017 must be 8/19.
Following the same way, the share of "a" at 2018 must be 8/22.

The problem is that there could be missing alphabets.
For example, "a" and "c" do not appear in 2018, but their cumulative fraction should be calculated at 2018, to make sure that sum of cumulative_share by each year must be 1.

I have tried to solve the issue by using for-loop, but could not find the solution.
I really thank any sage advice on my stata issue.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30154

05 Mar 2019, 15:23

No loops needed:

Code:

by id year, sort: gen cum_total = _N if _n == 1
by id: replace cum_total = sum(cum_total)

by id class year, sort: gen cum_appearances = _N if _n == 1
by id class: replace cum_appearances = sum(cum_appearances)

gen share = cum_appearances/cum_total

by id year class, sort: keep if _n == 1

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

05 Mar 2019, 15:44

You can use the fillin command to create observations when class does not appear in the year. You can then use rangestat (from SSC) to calculate the number of appearance using a recursive window of time (all observations up to and including the current year). You then drop the filled in observation that did not contribute to the count and reduce to one observation per group of id year class.

Code:

* create filler obs for when class is missing
gen touse = 1
fillin id class year

* the cumulative total
rangestat (count) ctot = touse, interval(year . 0) by(id)

* the cumulative total per class
rangestat (count) ctotc = touse, interval(year . 0) by(id class)

* reduce to one observation per -id class year-
drop if ctotc == 0
bysort id year class: keep if _n == 1

* calculate the desired measure
gen cshare = ctotc / ctot

list, sepby(id year)

and the list results:

Code:

. list, sepby(id year)

     +---------------------------------------------------------------+
     | id   year   class   touse   _fillin   ctot   ctotc     cshare |
     |---------------------------------------------------------------|
  1. |  1   2015       a       1         0      6       3         .5 |
  2. |  1   2015       b       1         0      6       2   .3333333 |
  3. |  1   2015       c       1         0      6       1   .1666667 |
     |---------------------------------------------------------------|
  4. |  1   2016       a       1         0     13       5   .3846154 |
  5. |  1   2016       b       1         0     13       4   .3076923 |
  6. |  1   2016       c       1         0     13       2   .1538462 |
  7. |  1   2016       d       1         0     13       2   .1538462 |
     |---------------------------------------------------------------|
  8. |  1   2017       a       1         0     19       8   .4210526 |
  9. |  1   2017       b       .         1     19       4   .2105263 |
 10. |  1   2017       c       .         1     19       2   .1052632 |
 11. |  1   2017       d       1         0     19       4   .2105263 |
 12. |  1   2017       e       1         0     19       1   .0526316 |
     |---------------------------------------------------------------|
 13. |  1   2018       a       .         1     22       8   .3636364 |
 14. |  1   2018       b       1         0     22       5   .2272727 |
 15. |  1   2018       c       .         1     22       2   .0909091 |
 16. |  1   2018       d       .         1     22       4   .1818182 |
 17. |  1   2018       e       1         0     22       2   .0909091 |
 18. |  1   2018       f       1         0     22       1   .0454545 |
     +---------------------------------------------------------------+

.

I'm sure that Clyde's approach can be retrofitted to the filled in data.

Comment

Andrew Choi

Join Date: Mar 2019

Posts: 6
#4

06 Mar 2019, 12:51

Thanks a lot for your helpful solutions!
Comment

Announcement