Creating cumulative variables in long format using by / bysort

Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#1

Creating cumulative variables in long format using by / bysort

04 May 2021, 12:20

I recently read Nick Cox's article on Identifying Spells (https://www.stata-journal.com/articl...article=dm0029). It's really interesting. I figured that you could alter it pretty easily to do cumulative variables.

I think you can capture total the amount of times something happened over the entire study using by modifying this code in section 7.2:

Code:

by id spell (year), sort: gen length = year[_N] - year[1] + 1

To the following:

Code:

by id (year), sort: gen cumulative_x = sum(X)

This would also, I think, be equivalent to using -egen, rowtotal in wide format.

Let's say that we wanted to sum the values of X up until the present time. In other words, to create a cumulative variable, summing the amount of X for each person up to the current time. So for example, imagine if a study started in 1990 and ended in 2010. At year=1999, we would sum the values from 1990 to 1998. At 2001, you would sum from 1990 to 2000. At 2005, you would sum from 1990 to 2004.

To put it a little more formally, I'd want to sum from year=1990 to year=_n-1, by id. I think this is what the sum() function was designed for, but maybe there's a better way to do this.

Thanks for all of your help.

EDIT: I think it's a rule that I find a relevant thread the moment I post something to Statalist. Here it is https://www.statalist.org/forums/for...ative-variable.

This suggests that sum(X) is the right way to generate the variable from the beginning to _n, but not creating a constant for each ID? Is that correct?

Last edited by Jonathan Horowitz; 04 May 2021, 12:31.
Tags: Generate, spells
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

04 May 2021, 12:36

Try

Code:

by id (year), sort: gen cumulative_x = sum(X) - X
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

04 May 2021, 12:38

This would also, I think, be equivalent to using -egen, rowtotal in wide format.

No. -egen rowtotal- creates only a single result which is the grand total of the variables specified; it does not create a running (cumulative) sum. To my knowledge, there is no -egen- function that is analogous to -gen ... sum()-.
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#4

04 May 2021, 12:50

Originally posted by Clyde Schechter View Post

No. -egen rowtotal- creates only a single result which is the grand total of the variables specified; it does not create a running (cumulative) sum. To my knowledge, there is no -egen- function that is analogous to -gen ... sum()-.

So the issue is that I appear to have misunderstood what gen, sum() does. I thought it created something equivalent to egen, rowtotal. But it appears that -gen, sum()- is what I want after all.

The reason for my confusion is that, as far as I can tell, gen, sum() does something different than egen, sum(). Some internet sleuthing suggests that egen, sum() isn't used anymore specifically because it got confused with gen, sum().

So, for running / cumulative totals, use gen, sum(). For constants, use egen, rowtotal in wide format or egen, sum (and perhaps, egen, total) in long format. As far as I can tell.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

04 May 2021, 12:56

So, for running / cumulative totals, use gen, sum(). For constants, use egen, rowtotal in wide format or egen, sum (and perhaps, egen, total) in long format.

Yes for running/cumulative totals, the data should be in long layout and you use -gen ... sum()-.

-egen, total()- produces a grand total (constant within -by()- groups) from long layout data. -egen, sum()- is identical in functioning to -egen, total()- but should be avoided because the name lends itself to confusion with -gen...sum()-, which does something different. If it were up to me, I would eliminate -egen, sum()- altogether (although maintain it under version control so as to not break legacy code.)

-egen, rowtotal()- works with wide data and produces a grand total. It is, in a sense, analogous to -egen, total()- with long data. But there are important differences. The argument specified in rowtotal() can be only a list of variables, where as the argument specified in total() can be any valid Stata expression. For example, with long data you can use -egen, total()- to calculate the total of something like var1² + var2². -egen, rowtotal()- cannot do that: you would first have to create new variables to hold the squares of var1 and var2 and then apply -egen, rowtotal()- to them. So -egen, total()- offers you greater flexibility. This is just one of the smallest ways in which working with long data is better than working with wide data for most purposes in Stata.

Last edited by Clyde Schechter; 04 May 2021, 13:00.
1 like
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#6

04 May 2021, 13:15

I do have one more question. Let's say I wanted to lag it by a year or two. Is there a way to get -gen, sum()- up until _n-1 or _n-2 and not _n? Would it just be gen, sum(X[_n-1])?

Thanks for your time.

Last edited by Jonathan Horowitz; 04 May 2021, 13:18.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#7

04 May 2021, 13:55

There are a few things you can do. You can create the usual running sum and then subtract X (and for a two year lag X[_n-1]) from the result.

Or you can -gen wanted = sum(x[_n-1]) for a 1 year lag, or sum(x[_n-2]) for a 2 year lag. Better still, since gaps in the data can make x[_n-1] an incorrect proxy for the actual lagged value, a safer way would be to first -xtset- the data with a group variable and a time variable, and then -gen wanted = sum(L1.x)- or -gen wanted = sum(L2.x)-.
1 like
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#8

04 May 2021, 13:58

sum(L1.x) is perfect. Thank you!
Comment

Announcement

Creating cumulative variables in long format using by / bysort

Comment

Comment

Comment

Comment

Comment

Comment

Comment