Collapsing and setting/preserving variable types

David Muhlestein

Join Date: Sep 2014
Posts: 11

Collapsing and setting/preserving variable types

27 Jun 2025, 10:16

I'm trying to optimize some data analysis in Stata 19.5. I have some large files that I am collapsing - a thousand variables and several million rows.

Code:

collapse (sum) c_*, by(p t)

The full output has about 1.2M observations and takes 155s to process.

Each variable in c_* is a byte. When I create the collapsed file, each variable should become an int. However, collapse will create each output as a double, which makes the collapsed file significantly larger than the original file. I can, of course, simple call:

Code:

compress

or:

Code:

foreach v of varlist c_* {
    recast int `v'
}

The issue, though, is the time to recast gets progressively slower, the more variables are in the data set.

Here is some sample code:

Code:

quietly foreach varcount in 10 50 100 250 {
    clear all
    set seed 1
    set obs 3500000
    gen p = round(runiform()*10000)
    gen t = round(runiform()*125)
    forvalues i = 1/`varcount' {
        gen byte c_`i' = round(runiform()*100)
    }

    timer on 1
    collapse (sum) c_*, by(p t)
    timer off 1

    timer on 2
    compress c_*
    timer off 2

    timer list

    noisily disp "Collapse time per variable (`varcount' variables): `=`r(t1)'/`varcount''"
    noisily disp "Compress time per variable (`varcount' variables): `=`r(t2)'/`varcount''"
}

Note the difference in time as the variable count increases - compressing/recasting is much slower. With 1000 variables, it is approximately 1.8s per variable (or 30 minutes for 1000 variables) to recast as an int. Since I will have to do this many dozens of times, just recasting could theoretically take days of processing.

Collapse time per variable (10 variables): .1587
Compress time per variable (10 variables): .0521
Collapse time per variable (50 variables): .10822
Compress time per variable (50 variables): .20896
Collapse time per variable (100 variables): .10996
Compress time per variable (100 variables): .29461
Collapse time per variable (250 variables): .12168
Compress time per variable (250 variables): .484288

Is there any way to define the output type with collapse to constrain it or to somehow speed this up? Or is there another approach that I could use to accomplish the same thing (using collapse or something else)?

I tried gcollapse, but it has the same datatype output as collapse (and is a little slower)

I tried fcollapse, and it creates the ideal datatype (int) because of its default compress option, but it sometimes has incorrect results (I'm separately looking into this). It also has a similar, but worse, issue as my mata command where it uses very high amounts of memory during processing (20x times the file size).

I also wrote a command in mata (I am not a good mata programmer, so this has AI help) and it was able to directly generate the output variables as type int. While it collapses a little bit faster than collapse, it has the major disadvantage of using 5 times as much RAM as collapse while processing (10x the file size compared to 2x).

Code:

void group_sum_optimized(string rowvector group_vars)
{
    // Declare variables
    string rowvector c_vars, sort_cmd_parts
    string scalar varname, sort_cmd
    real scalar n_c_vars, n_groups, i, j, start_row, end_row, n_vars, n_group_vars
    real colvector unique_group_data
    real matrix c_data, result, info, group_data, unique_groups
    
    // Default to provider_group_id if no group vars specified
    if (args() == 0 || cols(group_vars) == 0) {
        group_vars = "provider_group_id"
    }
    n_group_vars = cols(group_vars)
    
    // Verify grouping variables exist
    for (i = 1; i <= n_group_vars; i++) {
        if (st_varindex(group_vars[i]) == .) {
            errprintf("Variable %s not found\n", group_vars[i])
            return
        }
    }
    
    // Get all variable names and find c_ variables
    n_vars = st_nvar()
    c_vars = J(1, 0, "")
    
    for (i = 1; i <= n_vars; i++) {
        varname = st_varname(i)
        if (substr(varname, 1, 2) == "c_") {
            c_vars = c_vars, varname
        }
    }
    
    n_c_vars = cols(c_vars)
    
    if (n_c_vars == 0) {
        errprintf("No variables starting with 'c_' found\n")
        return
    }
    
    printf("Found %g c_ variables\n", n_c_vars)
    printf("Grouping by: %s\n", invtokens(group_vars, " "))
    
    // Sort data by grouping variables
    sort_cmd_parts = J(1, n_group_vars, "")
    for (i = 1; i <= n_group_vars; i++) {
        sort_cmd_parts[i] = group_vars[i]
    }
    sort_cmd = "sort " + invtokens(sort_cmd_parts, " ")
    stata(sort_cmd)
    
    // Read the grouping variables and c_ variables
    group_data = st_data(., group_vars)
    c_data = st_data(., c_vars)
    
    // Handle missing values - convert to 0
    c_data = editmissing(c_data, 0)
    
    // Use panelsetup for efficient grouping
    if (n_group_vars == 1) {
        info = panelsetup(group_data, 1)
    } else {
        info = panelsetup(group_data, 1..n_group_vars)
    }
    
    n_groups = rows(info)
    
    printf("Processing %g observations into %g groups\n", rows(group_data), n_groups)
    
    // Initialize results
    result = J(n_groups, n_c_vars, 0)
    unique_groups = J(n_groups, n_group_vars, .)
    
    // Sum within groups
    for (i = 1; i <= n_groups; i++) {
        start_row = info[i, 1]
        end_row = info[i, 2]
        
        // Store unique group values
        unique_groups[i, .] = group_data[start_row, .]
        
        // Sum c_ variables for this group
        if (start_row == end_row) {
            result[i, .] = c_data[start_row, .]
        } else {
            result[i, .] = colsum(c_data[start_row..end_row, .])
        }
        
        if (mod(i, 5000) == 0) printf("Processed %g groups\n", i)
    }
    
    // Create new dataset (suppress output)
    stata("capture quietly drop _all")
    stata("quietly set obs " + strofreal(n_groups))
    
    // Add grouping variables
    for (i = 1; i <= n_group_vars; i++) {
        stata("quietly generate long " + group_vars[i] + " = .")
        st_store(., group_vars[i], unique_groups[., i])
    }
    
    // Add summed c_ variables (with progress indicator)
    printf("Creating %g result variables", n_c_vars)
    for (i = 1; i <= n_c_vars; i++) {
        stata("quietly generate int " + c_vars[i] + " = .")
        st_store(., c_vars[i], result[., i])
        if (mod(i, 100) == 0) {
            printf(".")
            displayflush()
        }
    }
    printf(" done\n")
    
    printf("Completed: %g groups with %g c_ variables and %g grouping variables\n",
           n_groups, n_c_vars, n_group_vars)
}

Last edited by David Muhlestein; 27 Jun 2025, 10:40.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

27 Jun 2025, 15:04

If I'm correctly understanding what you need, I think there's a somewhat slower but memory saving approach by using -egen- to do the summation. Instead of -collapse-, what about something like this:

Code:

foreach v of varlist c_* { egen int sum`v' = total(`v'), by(p t) drop `v' } egen tag = tag(p t) keep if tag
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1402

27 Jun 2025, 15:47

At least on my machine, the egen method in #2 seems to be an order of magnitude slower than using collapse + compress:

I modified the code in #1 to include #2:

Code:

quietly foreach varcount in 10 50 100 250 {
    clear all
    set seed 1
    set obs 3500000
    gen p = round(runiform()*10000)
    gen t = round(runiform()*125)
    forvalues i = 1/`varcount' {
        gen byte c_`i' = round(runiform()*100)
    }

    preserve
        timer on 1
        collapse (sum) c_*, by(p t)
        timer off 1

        timer on 2
        compress c_*
        timer off 2
    restore
    
    timer on 3    
    foreach v of varlist c_* {
       egen int sum`v' = total(`v'), by(p t)
       drop `v'
    }
    egen tag = tag(p t)
    keep if tag
    timer off 3
    
    timer list

    noisily disp "Collapse time per variable (`varcount' variables): `=`r(t1)'/`varcount''"
    noisily disp "Compress time per variable (`varcount' variables): `=`r(t2)'/`varcount''"
    noisily disp "Egen time per variable (`varcount' variables): `=`r(t3)'/`varcount''"
    
}

and got this:

Code:

Collapse time per variable (10 variables): .2484
Compress time per variable (10 variables): .0554
Egen time per variable (10 variables): 2.7405
Collapse time per variable (50 variables): .23966
Compress time per variable (50 variables): .37028
Egen time per variable (50 variables): 2.76216
Collapse time per variable (100 variables): .22078
Compress time per variable (100 variables): .58587
Egen time per variable (100 variables): 3.32624

(I didn't have the patience to wait for the iteration with 250 variables). It's worth mentioning that I'm using Stata 19.5/MP-2 Core on a Mac with an M2 Pro chip.

Comment

David Muhlestein

Join Date: Sep 2014

Posts: 11
#4

27 Jun 2025, 20:24

[QUOTE=Mike Lacy;n1779342]If I'm correctly understanding what you need, I think there's a somewhat slower but memory saving approach by using -egen- to do the summation. Instead of -collapse-, what about something like this:
/QUOTE]
Thanks, Mike. Yes, this is an approach that works. I ran Hemanshu's code and got this for timing:

Collapse time per variable (10 variables): .1604
Compress time per variable (10 variables): .0787
Egen time per variable (10 variables): 2.2939
Collapse time per variable (50 variables): .1045
Compress time per variable (50 variables): .18456
Egen time per variable (50 variables): 2.2004
Collapse time per variable (100 variables): .10919
Compress time per variable (100 variables): .28265
Egen time per variable (100 variables): 2.32091
Collapse time per variable (250 variables): .10702
Compress time per variable (250 variables): .454496
Egen time per variable (250 variables): 3.016404

It gets progressively slower with more variables, but is quiet memory efficient - only about 10% bigger than the base file. My goal is to optimize a combination of speed, final storage space and max RAM utilization. This is definitely the best with RAM, and the final file will be compressed, but it's also the slowest. With 1000 variables, it was taking 6.5s per variable, so that would be about 2 hours to process the whole file.
Comment
David Muhlestein

Join Date: Sep 2014

Posts: 11
#5

27 Jun 2025, 20:37

Also, to provide some reference, the sample file is only part of what I ultimately process. I have files that have 12k variables and 25M observations. I have a machine with 512GB of RAM, but even with that, I've found that Stata can only consistently collapse a file that is about 3.5M observations and 1000 variables without randomly having some i/o or memory error, so I have to chunk my larger files and process them and then combine them. One of my larger files can be 12*7 chunks, leading to 84 collapses plus splitting and joining, so anything I can do to be more efficient is really valuable. When I'm chunking the files, I have to store them (either on disk or memory), and when the default numeric type is double that is problematic.

I'm also confused why it takes Stata so much longer to process a variable (such as with compress) when there are many variables in memory - it seems like if it's already in memory it shouldn't take any longer to process. Any insights into that would be welcome.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#6

28 Jun 2025, 08:19

Here are some other ideas that might help with the -egen- approach.

1) It occurred to me that the -by- option on -egen- necessitates sorting of the data. So, I tried putting -sort p t- before the -egen- command on the idea that -egen- is smart enough to check if the data is already sorted. It apparently does do that. In the small example I tried (3 vars), run time dropped by 5X when I put the sort in.

2) You might also want to try -runby- here (-ssc describe runby-). I admire it but have never used it, but my understanding is that it can offer big improvements for programs involving -by-. To my understanding, this would eliminate the use of the -by- option on -egen-, which might help a lot.
Comment

Announcement

Collapsing and setting/preserving variable types

Comment

Comment

Comment

Comment

Comment