Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapsing and setting/preserving variable types

    I'm trying to optimize some data analysis in Stata 19.5. I have some large files that I am collapsing - a thousand variables and several million rows.

    Code:
    collapse (sum) c_*, by(p t)
    The full output has about 1.2M observations and takes 155s to process.

    Each variable in c_* is a byte. When I create the collapsed file, each variable should become an int. However, collapse will create each output as a double, which makes the collapsed file significantly larger than the original file. I can, of course, simple call:

    Code:
    compress
    or:

    Code:
    foreach v of varlist c_* {
        recast int `v'
    }
    The issue, though, is the time to recast gets progressively slower, the more variables are in the data set.

    Here is some sample code:

    Code:
    quietly foreach varcount in 10 50 100 250 {
        clear all
        set seed 1
        set obs 3500000
        gen p = round(runiform()*10000)
        gen t = round(runiform()*125)
        forvalues i = 1/`varcount' {
            gen byte c_`i' = round(runiform()*100)
        }
    
        timer on 1
        collapse (sum) c_*, by(p t)
        timer off 1
    
        timer on 2
        compress c_*
        timer off 2
    
        timer list
    
        noisily disp "Collapse time per variable (`varcount' variables): `=`r(t1)'/`varcount''"
        noisily disp "Compress time per variable (`varcount' variables): `=`r(t2)'/`varcount''"
    }
    Note the difference in time as the variable count increases - compressing/recasting is much slower. With 1000 variables, it is approximately 1.8s per variable (or 30 minutes for 1000 variables) to recast as an int. Since I will have to do this many dozens of times, just recasting could theoretically take days of processing.



    Collapse time per variable (10 variables): .1587
    Compress time per variable (10 variables): .0521
    Collapse time per variable (50 variables): .10822
    Compress time per variable (50 variables): .20896
    Collapse time per variable (100 variables): .10996
    Compress time per variable (100 variables): .29461
    Collapse time per variable (250 variables): .12168
    Compress time per variable (250 variables): .484288

    Is there any way to define the output type with collapse to constrain it or to somehow speed this up? Or is there another approach that I could use to accomplish the same thing (using collapse or something else)?

    I tried gcollapse, but it has the same datatype output as collapse (and is a little slower)

    I tried fcollapse, and it creates the ideal datatype (int) because of its default compress option, but it sometimes has incorrect results (I'm separately looking into this). It also has a similar, but worse, issue as my mata command where it uses very high amounts of memory during processing (20x times the file size).


    I also wrote a command in mata (I am not a good mata programmer, so this has AI help) and it was able to directly generate the output variables as type int. While it collapses a little bit faster than collapse, it has the major disadvantage of using 5 times as much RAM as collapse while processing (10x the file size compared to 2x).

    Code:
    void group_sum_optimized(string rowvector group_vars)
    {
        // Declare variables
        string rowvector c_vars, sort_cmd_parts
        string scalar varname, sort_cmd
        real scalar n_c_vars, n_groups, i, j, start_row, end_row, n_vars, n_group_vars
        real colvector unique_group_data
        real matrix c_data, result, info, group_data, unique_groups
        
        // Default to provider_group_id if no group vars specified
        if (args() == 0 || cols(group_vars) == 0) {
            group_vars = "provider_group_id"
        }
        n_group_vars = cols(group_vars)
        
        // Verify grouping variables exist
        for (i = 1; i <= n_group_vars; i++) {
            if (st_varindex(group_vars[i]) == .) {
                errprintf("Variable %s not found\n", group_vars[i])
                return
            }
        }
        
        // Get all variable names and find c_ variables
        n_vars = st_nvar()
        c_vars = J(1, 0, "")
        
        for (i = 1; i <= n_vars; i++) {
            varname = st_varname(i)
            if (substr(varname, 1, 2) == "c_") {
                c_vars = c_vars, varname
            }
        }
        
        n_c_vars = cols(c_vars)
        
        if (n_c_vars == 0) {
            errprintf("No variables starting with 'c_' found\n")
            return
        }
        
        printf("Found %g c_ variables\n", n_c_vars)
        printf("Grouping by: %s\n", invtokens(group_vars, " "))
        
        // Sort data by grouping variables
        sort_cmd_parts = J(1, n_group_vars, "")
        for (i = 1; i <= n_group_vars; i++) {
            sort_cmd_parts[i] = group_vars[i]
        }
        sort_cmd = "sort " + invtokens(sort_cmd_parts, " ")
        stata(sort_cmd)
        
        // Read the grouping variables and c_ variables
        group_data = st_data(., group_vars)
        c_data = st_data(., c_vars)
        
        // Handle missing values - convert to 0
        c_data = editmissing(c_data, 0)
        
        // Use panelsetup for efficient grouping
        if (n_group_vars == 1) {
            info = panelsetup(group_data, 1)
        } else {
            info = panelsetup(group_data, 1..n_group_vars)
        }
        
        n_groups = rows(info)
        
        printf("Processing %g observations into %g groups\n", rows(group_data), n_groups)
        
        // Initialize results
        result = J(n_groups, n_c_vars, 0)
        unique_groups = J(n_groups, n_group_vars, .)
        
        // Sum within groups
        for (i = 1; i <= n_groups; i++) {
            start_row = info[i, 1]
            end_row = info[i, 2]
            
            // Store unique group values
            unique_groups[i, .] = group_data[start_row, .]
            
            // Sum c_ variables for this group
            if (start_row == end_row) {
                result[i, .] = c_data[start_row, .]
            } else {
                result[i, .] = colsum(c_data[start_row..end_row, .])
            }
            
            if (mod(i, 5000) == 0) printf("Processed %g groups\n", i)
        }
        
        // Create new dataset (suppress output)
        stata("capture quietly drop _all")
        stata("quietly set obs " + strofreal(n_groups))
        
        // Add grouping variables
        for (i = 1; i <= n_group_vars; i++) {
            stata("quietly generate long " + group_vars[i] + " = .")
            st_store(., group_vars[i], unique_groups[., i])
        }
        
        // Add summed c_ variables (with progress indicator)
        printf("Creating %g result variables", n_c_vars)
        for (i = 1; i <= n_c_vars; i++) {
            stata("quietly generate int " + c_vars[i] + " = .")
            st_store(., c_vars[i], result[., i])
            if (mod(i, 100) == 0) {
                printf(".")
                displayflush()
            }
        }
        printf(" done\n")
        
        printf("Completed: %g groups with %g c_ variables and %g grouping variables\n",
               n_groups, n_c_vars, n_group_vars)
    }
    Last edited by David Muhlestein; 27 Jun 2025, 10:40.

  • #2
    If I'm correctly understanding what you need, I think there's a somewhat slower but memory saving approach by using -egen- to do the summation. Instead of -collapse-, what about something like this:
    Code:
    foreach v of varlist c_* {
       egen int sum`v' = total(`v'), by(p t)
       drop `v'
    }
    egen tag = tag(p t)
    keep if tag

    Comment


    • #3
      At least on my machine, the egen method in #2 seems to be an order of magnitude slower than using collapse + compress:

      I modified the code in #1 to include #2:

      Code:
      quietly foreach varcount in 10 50 100 250 {
          clear all
          set seed 1
          set obs 3500000
          gen p = round(runiform()*10000)
          gen t = round(runiform()*125)
          forvalues i = 1/`varcount' {
              gen byte c_`i' = round(runiform()*100)
          }
      
          preserve
              timer on 1
              collapse (sum) c_*, by(p t)
              timer off 1
      
              timer on 2
              compress c_*
              timer off 2
          restore
          
          timer on 3    
          foreach v of varlist c_* {
             egen int sum`v' = total(`v'), by(p t)
             drop `v'
          }
          egen tag = tag(p t)
          keep if tag
          timer off 3
          
          timer list
      
          noisily disp "Collapse time per variable (`varcount' variables): `=`r(t1)'/`varcount''"
          noisily disp "Compress time per variable (`varcount' variables): `=`r(t2)'/`varcount''"
          noisily disp "Egen time per variable (`varcount' variables): `=`r(t3)'/`varcount''"
          
      }
      and got this:
      Code:
      Collapse time per variable (10 variables): .2484
      Compress time per variable (10 variables): .0554
      Egen time per variable (10 variables): 2.7405
      Collapse time per variable (50 variables): .23966
      Compress time per variable (50 variables): .37028
      Egen time per variable (50 variables): 2.76216
      Collapse time per variable (100 variables): .22078
      Compress time per variable (100 variables): .58587
      Egen time per variable (100 variables): 3.32624
      (I didn't have the patience to wait for the iteration with 250 variables). It's worth mentioning that I'm using Stata 19.5/MP-2 Core on a Mac with an M2 Pro chip.

      Comment


      • #4
        [QUOTE=Mike Lacy;n1779342]If I'm correctly understanding what you need, I think there's a somewhat slower but memory saving approach by using -egen- to do the summation. Instead of -collapse-, what about something like this:
        /QUOTE]
        Thanks, Mike. Yes, this is an approach that works. I ran Hemanshu's code and got this for timing:


        Collapse time per variable (10 variables): .1604
        Compress time per variable (10 variables): .0787
        Egen time per variable (10 variables): 2.2939
        Collapse time per variable (50 variables): .1045
        Compress time per variable (50 variables): .18456
        Egen time per variable (50 variables): 2.2004
        Collapse time per variable (100 variables): .10919
        Compress time per variable (100 variables): .28265
        Egen time per variable (100 variables): 2.32091
        Collapse time per variable (250 variables): .10702
        Compress time per variable (250 variables): .454496
        Egen time per variable (250 variables): 3.016404
        It gets progressively slower with more variables, but is quiet memory efficient - only about 10% bigger than the base file. My goal is to optimize a combination of speed, final storage space and max RAM utilization. This is definitely the best with RAM, and the final file will be compressed, but it's also the slowest. With 1000 variables, it was taking 6.5s per variable, so that would be about 2 hours to process the whole file.

        Comment


        • #5
          Also, to provide some reference, the sample file is only part of what I ultimately process. I have files that have 12k variables and 25M observations. I have a machine with 512GB of RAM, but even with that, I've found that Stata can only consistently collapse a file that is about 3.5M observations and 1000 variables without randomly having some i/o or memory error, so I have to chunk my larger files and process them and then combine them. One of my larger files can be 12*7 chunks, leading to 84 collapses plus splitting and joining, so anything I can do to be more efficient is really valuable. When I'm chunking the files, I have to store them (either on disk or memory), and when the default numeric type is double that is problematic.

          I'm also confused why it takes Stata so much longer to process a variable (such as with compress) when there are many variables in memory - it seems like if it's already in memory it shouldn't take any longer to process. Any insights into that would be welcome.

          Comment


          • #6
            Here are some other ideas that might help with the -egen- approach.

            1) It occurred to me that the -by- option on -egen- necessitates sorting of the data. So, I tried putting -sort p t- before the -egen- command on the idea that -egen- is smart enough to check if the data is already sorted. It apparently does do that. In the small example I tried (3 vars), run time dropped by 5X when I put the sort in.

            2) You might also want to try -runby- here (-ssc describe runby-). I admire it but have never used it, but my understanding is that it can offer big improvements for programs involving -by-. To my understanding, this would eliminate the use of the -by- option on -egen-, which might help a lot.

            Comment

            Working...
            X