Alternative to -collapse- command

Brent Fulton

Join Date: Oct 2015

Posts: 16
#1

Alternative to -collapse- command

02 Dec 2015, 17:39

I am using Stata 14.0 with 25 million observations where I'd like to collapse on 300 "x" variables (mostly sum, but some means) using about 12 different by "v" variable combinations, but a single collapse command is taking several hours. I realize I could do this loop for one of by "v" variable combinations, but the bysorts take a long time as well. I'd appreciate any suggestions.

foreach x in varlist x1-x300{
bysort v1 v2: gen temp=sum(`x')
bysort v1 v2: egen `x'_sum=max(temp)
drop temp
}
egen tag_1_sum = tag(v1 v2)
keep if tag_1_sum==1
sav "data1.dta", replace

Thanks,
Brent Fulton
UC Berkeley
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

02 Dec 2015, 17:58

egen is convenient but moderately slow. Turning to it is likely to worsen your problem, not to improve it.

Code:

viewsource egen.ado viewsource _gmax.ado viewsource _gtag.ado

to see that egen alone consists of several commands and the max() and tag() functions also entail calling several commands.

Your foreach call uses in where of is needed.

I suspect that this would be faster than your draft, but I can't promise that it would be faster than collapse.

Code:

sort v1 v2 forval j = 1/300 { by v1 v2: gen double x`j'sum = sum(x`j') } keep v1 v2 x*sum bysort v1 v2: keep if _n == _N

Note that there is no need to fire up any command, or create any extra variables, to hold the sums. They are already there in the last observation in each group. Similarly the tagging is unnecessary.

EDIT: I took a key improvement, deleting an unnecessary line, from Clyde's post.

Last edited by Nick Cox; 02 Dec 2015, 18:10.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#3

02 Dec 2015, 18:04

With 25 million observations and 300 x variables, I think you're in for a long haul no matter what you do. Nonetheless, there are some marginal efficiencies to be gained by modifying your code.

But first, let me question something. the narrative in your paragraph says you want to calculate sums. But that is not what your -foreach- loop does. Or, more accurately, it only computes sums if all non-missing values of the x variables are >= 0. I'm going to disregard your code, and assume you really want to calculate sums. The key here is that the -gen- function sum() calculates a running sum, so the total for the by-group appears in the last observation of the by-group. That being so, there is no need to do any more calculation after the -gen ... sum()- command. You just need to hold the last observation--which also does not require creating a tag.

Code:

sort v1 v2 foreach x of varlist x1-x300 { by v1 v2: gen `x'_sum = sum(`x') } by v1 v2: keep if _n == _N save data1, replace

NOTE: While I normally prefer the transparency of the -egen- functions, they do contain a fair amount of overhead. So when the data set is huge and time is at a premium, it makes sense to go back to the underlying -gen- statements that -egen- is a wrapper for. Also, sorting only once on v1 and v2 should be sufficient. I'm not sure it makes a time difference because I would guess that bysort is smart enough to not re-sort if the data are already sorted, but I don't really know.

That said, I don't know if this will save you a whole lot of time, but it will be somewhat faster. But I can't see anyway to do it faster than this. Perhaps somebody else who does will chime in.

Added after seeing Nick's post: my code is almost identical to his. There may be better performance with -forvalues- than with -foreach-, so in that respect his code is better than mine. And casting the sum variables as doubles is a very good idea, too--it guards against rounding errors during cumulative summation. On the other hand, the -replace- command inside Nick's loop isn't necessary, because only the _N'th observation is going to be retained anyway.

Last edited by Clyde Schechter; 02 Dec 2015, 18:11.
Comment
Brent Fulton

Join Date: Oct 2015

Posts: 16
#4

02 Dec 2015, 18:52

Hi Nick and Clyde,

Thank you for your recommendations. Clyde, thank you for catching my sum error, which may lead to another possibility. My x1-x300 variables are mostly 0-1 binary variables. For the time being, let's assume all are. I think the following would work, but it may take longer with -egen-. Any way to avoid -egen-?

sort v1 v2
foreach x of varlist x1-x300{
by v1 v2 `x': gen temp = `x' * _N /* so all the `x'=0 would be 0 */
by v1 v2: egen `x'_sum = max(temp)
}
by v1 v2: keep if _n==_N

Best,
Brent
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#5

02 Dec 2015, 19:00

I don't think your calculation works as you intend. It is true that for x = 0, `x'*_N = 0 as well. But when `x' = 1, `x'*_N is _n. So temp will always be either 0 or _N. Then when you take the max value of temp within the by-group, it will always be _N, unless `x' = 0 throughout the by-group.

I don't think there is anyway to use the fact that `x' is always 0 or 1 to speed up the calculation of a sum beyond what Nick and I came up with in #2 and #3. The fact that `x' = 0 or 1 does open up other ways to calculate the sum, but all the ones I can think of are actually slower.
Comment
Brent Fulton

Join Date: Oct 2015

Posts: 16
#6

03 Dec 2015, 00:17

Thank you, Clyde, I will use your and Nick's original recommendations.
Best
Brent
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#7

03 Dec 2015, 01:36

It was late my time when I last contributed to this thread, and now it's morning where I am.

I don't understand the point Clyde raised about the sum. If there are negative values, then their absolute values are subtracted from the sum so far. That's what a sum is. The sum of 42, 0, -42 is 0. If you want something else, e.g. the sum of absolute values, you need to spell that out and the code will change.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#8

03 Dec 2015, 08:28

@Nick. In the code in #1, if the values of x were 42, 0, and -42, then the corresponding values of temp would be 42, 42, and 0. Then x_sum, chosen as the max of temp, would be 42, which is different from the sum, 0. But if all values of x were non-negative, then the maximum value of the running total would indeed be the final value and would equal the sum.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#9

03 Dec 2015, 08:52

Clyde: Thanks for that. It's a comment on the tacit assumption in Brent's code. That really was clear in your original post, as I now see. Sorry for the needless diversion.
Comment

Andrew Lover

Join Date: Apr 2014
Posts: 182

#10

03 Dec 2015, 20:12

Super-kludgy, but maybe worth trying using -Rsource- (SSC). And just to note, -Rsource- does require a little setup.

Code:

* clear all
rsource, terminator(END_OF_R) roptions(--slave)

######################################################################

    # Start R code
    # install.packages("nycflights13")
    # install.packages("dplyr")
    # install.packages("foreign")
    library(nycflights13)
    library(dplyr)
    ptm <- proc.time()
    summarise(flights, delay = median(distance, na.rm = TRUE))
    proc.time() - ptm
    fligths <- flights
    write.csv(flights, file="nyflights13.csv")
    
    # End R code

######################################################################

END_OF_R

import delimited "/User/XYXY/nyflights13.csv", encoding(ISO-8859-1) clear

timer on 1
collapse (median) distance
timer off 1

timer list

So it looks like ca. 0.440 vs 0.7950 seconds; but for (sum) and (mean) Stata is faster than R... so your results may vary!

Last edited by Andrew Lover; 03 Dec 2015, 20:48.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst

Comment

Andrew Lover

Join Date: Apr 2014

Posts: 182
#11

03 Dec 2015, 22:21

Should be:

Code:

flights <- flights

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment

Announcement