Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Alternative to -collapse- command

    I am using Stata 14.0 with 25 million observations where I'd like to collapse on 300 "x" variables (mostly sum, but some means) using about 12 different by "v" variable combinations, but a single collapse command is taking several hours. I realize I could do this loop for one of by "v" variable combinations, but the bysorts take a long time as well. I'd appreciate any suggestions.

    foreach x in varlist x1-x300{
    bysort v1 v2: gen temp=sum(`x')
    bysort v1 v2: egen `x'_sum=max(temp)
    drop temp
    }
    egen tag_1_sum = tag(v1 v2)
    keep if tag_1_sum==1
    sav "data1.dta", replace

    Thanks,
    Brent Fulton
    UC Berkeley

  • #2
    egen is convenient but moderately slow. Turning to it is likely to worsen your problem, not to improve it.

    Code:
    viewsource egen.ado
    
    viewsource _gmax.ado
    viewsource _gtag.ado
    to see that egen alone consists of several commands and the max() and tag() functions also entail calling several commands.

    Your foreach call uses in where of is needed.

    I suspect that this would be faster than your draft, but I can't promise that it would be faster than collapse.

    Code:
    sort v1 v2
    forval j = 1/300 {
       by v1 v2: gen double x`j'sum = sum(x`j')
    }
    keep v1 v2 x*sum
    bysort v1 v2: keep if _n == _N
    Note that there is no need to fire up any command, or create any extra variables, to hold the sums. They are already there in the last observation in each group. Similarly the tagging is unnecessary.

    EDIT: I took a key improvement, deleting an unnecessary line, from Clyde's post.
    Last edited by Nick Cox; 02 Dec 2015, 18:10.

    Comment


    • #3
      With 25 million observations and 300 x variables, I think you're in for a long haul no matter what you do. Nonetheless, there are some marginal efficiencies to be gained by modifying your code.

      But first, let me question something. the narrative in your paragraph says you want to calculate sums. But that is not what your -foreach- loop does. Or, more accurately, it only computes sums if all non-missing values of the x variables are >= 0. I'm going to disregard your code, and assume you really want to calculate sums. The key here is that the -gen- function sum() calculates a running sum, so the total for the by-group appears in the last observation of the by-group. That being so, there is no need to do any more calculation after the -gen ... sum()- command. You just need to hold the last observation--which also does not require creating a tag.

      Code:
      sort v1 v2
      foreach x of varlist x1-x300 {
           by v1 v2: gen `x'_sum = sum(`x')
       }
      by v1 v2: keep if _n == _N
      save data1, replace
      NOTE: While I normally prefer the transparency of the -egen- functions, they do contain a fair amount of overhead. So when the data set is huge and time is at a premium, it makes sense to go back to the underlying -gen- statements that -egen- is a wrapper for. Also, sorting only once on v1 and v2 should be sufficient. I'm not sure it makes a time difference because I would guess that bysort is smart enough to not re-sort if the data are already sorted, but I don't really know.

      That said, I don't know if this will save you a whole lot of time, but it will be somewhat faster. But I can't see anyway to do it faster than this. Perhaps somebody else who does will chime in.

      Added after seeing Nick's post: my code is almost identical to his. There may be better performance with -forvalues- than with -foreach-, so in that respect his code is better than mine. And casting the sum variables as doubles is a very good idea, too--it guards against rounding errors during cumulative summation. On the other hand, the -replace- command inside Nick's loop isn't necessary, because only the _N'th observation is going to be retained anyway.
      Last edited by Clyde Schechter; 02 Dec 2015, 18:11.

      Comment


      • #4
        Hi Nick and Clyde,

        Thank you for your recommendations. Clyde, thank you for catching my sum error, which may lead to another possibility. My x1-x300 variables are mostly 0-1 binary variables. For the time being, let's assume all are. I think the following would work, but it may take longer with -egen-. Any way to avoid -egen-?

        sort v1 v2
        foreach x of varlist x1-x300{
        by v1 v2 `x': gen temp = `x' * _N /* so all the `x'=0 would be 0 */
        by v1 v2: egen `x'_sum = max(temp)
        }
        by v1 v2: keep if _n==_N


        Best,
        Brent

        Comment


        • #5
          I don't think your calculation works as you intend. It is true that for x = 0, `x'*_N = 0 as well. But when `x' = 1, `x'*_N is _n. So temp will always be either 0 or _N. Then when you take the max value of temp within the by-group, it will always be _N, unless `x' = 0 throughout the by-group.

          I don't think there is anyway to use the fact that `x' is always 0 or 1 to speed up the calculation of a sum beyond what Nick and I came up with in #2 and #3. The fact that `x' = 0 or 1 does open up other ways to calculate the sum, but all the ones I can think of are actually slower.

          Comment


          • #6
            Thank you, Clyde, I will use your and Nick's original recommendations.
            Best
            Brent

            Comment


            • #7
              It was late my time when I last contributed to this thread, and now it's morning where I am.

              I don't understand the point Clyde raised about the sum. If there are negative values, then their absolute values are subtracted from the sum so far. That's what a sum is. The sum of 42, 0, -42 is 0. If you want something else, e.g. the sum of absolute values, you need to spell that out and the code will change.

              Comment


              • #8
                @Nick. In the code in #1, if the values of x were 42, 0, and -42, then the corresponding values of temp would be 42, 42, and 0. Then x_sum, chosen as the max of temp, would be 42, which is different from the sum, 0. But if all values of x were non-negative, then the maximum value of the running total would indeed be the final value and would equal the sum.

                Comment


                • #9
                  Clyde: Thanks for that. It's a comment on the tacit assumption in Brent's code. That really was clear in your original post, as I now see. Sorry for the needless diversion.

                  Comment


                  • #10
                    Super-kludgy, but maybe worth trying using -Rsource- (SSC). And just to note, -Rsource- does require a little setup.

                    Code:
                    * clear all
                    rsource, terminator(END_OF_R) roptions(--slave)
                    
                    ######################################################################
                    
                        # Start R code
                        # install.packages("nycflights13")
                        # install.packages("dplyr")
                        # install.packages("foreign")
                        library(nycflights13)
                        library(dplyr)
                        ptm <- proc.time()
                        summarise(flights, delay = median(distance, na.rm = TRUE))
                        proc.time() - ptm
                        fligths <- flights
                        write.csv(flights, file="nyflights13.csv")
                        
                        # End R code
                    
                    ######################################################################
                    
                    END_OF_R
                    
                    import delimited "/User/XYXY/nyflights13.csv", encoding(ISO-8859-1) clear
                    
                    timer on 1
                    collapse (median) distance
                    timer off 1
                    
                    timer list
                    So it looks like ca. 0.440 vs 0.7950 seconds; but for (sum) and (mean) Stata is faster than R... so your results may vary!
                    Last edited by Andrew Lover; 03 Dec 2015, 20:48.
                    __________________________________________________ __
                    Assistant Professor, Department of Biostatistics and Epidemiology
                    School of Public Health and Health Sciences
                    University of Massachusetts- Amherst

                    Comment


                    • #11
                      Should be:

                      Code:
                      flights <- flights
                      __________________________________________________ __
                      Assistant Professor, Department of Biostatistics and Epidemiology
                      School of Public Health and Health Sciences
                      University of Massachusetts- Amherst

                      Comment

                      Working...
                      X