I'm trying to optimize some data analysis in Stata 19.5. I have some large files that I am collapsing - a thousand variables and several million rows.
The full output has about 1.2M observations and takes 155s to process.
Each variable in c_* is a byte. When I create the collapsed file, each variable should become an int. However, collapse will create each output as a double, which makes the collapsed file significantly larger than the original file. I can, of course, simple call:
or:
The issue, though, is the time to recast gets progressively slower, the more variables are in the data set.
Here is some sample code:
Note the difference in time as the variable count increases - compressing/recasting is much slower. With 1000 variables, it is approximately 1.8s per variable (or 30 minutes for 1000 variables) to recast as an int. Since I will have to do this many dozens of times, just recasting could theoretically take days of processing.
Collapse time per variable (10 variables): .1587
Compress time per variable (10 variables): .0521
Collapse time per variable (50 variables): .10822
Compress time per variable (50 variables): .20896
Collapse time per variable (100 variables): .10996
Compress time per variable (100 variables): .29461
Collapse time per variable (250 variables): .12168
Compress time per variable (250 variables): .484288
Is there any way to define the output type with collapse to constrain it or to somehow speed this up? Or is there another approach that I could use to accomplish the same thing (using collapse or something else)?
I tried gcollapse, but it has the same datatype output as collapse (and is a little slower)
I tried fcollapse, and it creates the ideal datatype (int) because of its default compress option, but it sometimes has incorrect results (I'm separately looking into this). It also has a similar, but worse, issue as my mata command where it uses very high amounts of memory during processing (20x times the file size).
I also wrote a command in mata (I am not a good mata programmer, so this has AI help) and it was able to directly generate the output variables as type int. While it collapses a little bit faster than collapse, it has the major disadvantage of using 5 times as much RAM as collapse while processing (10x the file size compared to 2x).
Code:
collapse (sum) c_*, by(p t)
Each variable in c_* is a byte. When I create the collapsed file, each variable should become an int. However, collapse will create each output as a double, which makes the collapsed file significantly larger than the original file. I can, of course, simple call:
Code:
compress
Code:
foreach v of varlist c_* { recast int `v' }
Here is some sample code:
Code:
quietly foreach varcount in 10 50 100 250 { clear all set seed 1 set obs 3500000 gen p = round(runiform()*10000) gen t = round(runiform()*125) forvalues i = 1/`varcount' { gen byte c_`i' = round(runiform()*100) } timer on 1 collapse (sum) c_*, by(p t) timer off 1 timer on 2 compress c_* timer off 2 timer list noisily disp "Collapse time per variable (`varcount' variables): `=`r(t1)'/`varcount''" noisily disp "Compress time per variable (`varcount' variables): `=`r(t2)'/`varcount''" }
Collapse time per variable (10 variables): .1587
Compress time per variable (10 variables): .0521
Collapse time per variable (50 variables): .10822
Compress time per variable (50 variables): .20896
Collapse time per variable (100 variables): .10996
Compress time per variable (100 variables): .29461
Collapse time per variable (250 variables): .12168
Compress time per variable (250 variables): .484288
Is there any way to define the output type with collapse to constrain it or to somehow speed this up? Or is there another approach that I could use to accomplish the same thing (using collapse or something else)?
I tried gcollapse, but it has the same datatype output as collapse (and is a little slower)
I tried fcollapse, and it creates the ideal datatype (int) because of its default compress option, but it sometimes has incorrect results (I'm separately looking into this). It also has a similar, but worse, issue as my mata command where it uses very high amounts of memory during processing (20x times the file size).
I also wrote a command in mata (I am not a good mata programmer, so this has AI help) and it was able to directly generate the output variables as type int. While it collapses a little bit faster than collapse, it has the major disadvantage of using 5 times as much RAM as collapse while processing (10x the file size compared to 2x).
Code:
void group_sum_optimized(string rowvector group_vars) { // Declare variables string rowvector c_vars, sort_cmd_parts string scalar varname, sort_cmd real scalar n_c_vars, n_groups, i, j, start_row, end_row, n_vars, n_group_vars real colvector unique_group_data real matrix c_data, result, info, group_data, unique_groups // Default to provider_group_id if no group vars specified if (args() == 0 || cols(group_vars) == 0) { group_vars = "provider_group_id" } n_group_vars = cols(group_vars) // Verify grouping variables exist for (i = 1; i <= n_group_vars; i++) { if (st_varindex(group_vars[i]) == .) { errprintf("Variable %s not found\n", group_vars[i]) return } } // Get all variable names and find c_ variables n_vars = st_nvar() c_vars = J(1, 0, "") for (i = 1; i <= n_vars; i++) { varname = st_varname(i) if (substr(varname, 1, 2) == "c_") { c_vars = c_vars, varname } } n_c_vars = cols(c_vars) if (n_c_vars == 0) { errprintf("No variables starting with 'c_' found\n") return } printf("Found %g c_ variables\n", n_c_vars) printf("Grouping by: %s\n", invtokens(group_vars, " ")) // Sort data by grouping variables sort_cmd_parts = J(1, n_group_vars, "") for (i = 1; i <= n_group_vars; i++) { sort_cmd_parts[i] = group_vars[i] } sort_cmd = "sort " + invtokens(sort_cmd_parts, " ") stata(sort_cmd) // Read the grouping variables and c_ variables group_data = st_data(., group_vars) c_data = st_data(., c_vars) // Handle missing values - convert to 0 c_data = editmissing(c_data, 0) // Use panelsetup for efficient grouping if (n_group_vars == 1) { info = panelsetup(group_data, 1) } else { info = panelsetup(group_data, 1..n_group_vars) } n_groups = rows(info) printf("Processing %g observations into %g groups\n", rows(group_data), n_groups) // Initialize results result = J(n_groups, n_c_vars, 0) unique_groups = J(n_groups, n_group_vars, .) // Sum within groups for (i = 1; i <= n_groups; i++) { start_row = info[i, 1] end_row = info[i, 2] // Store unique group values unique_groups[i, .] = group_data[start_row, .] // Sum c_ variables for this group if (start_row == end_row) { result[i, .] = c_data[start_row, .] } else { result[i, .] = colsum(c_data[start_row..end_row, .]) } if (mod(i, 5000) == 0) printf("Processed %g groups\n", i) } // Create new dataset (suppress output) stata("capture quietly drop _all") stata("quietly set obs " + strofreal(n_groups)) // Add grouping variables for (i = 1; i <= n_group_vars; i++) { stata("quietly generate long " + group_vars[i] + " = .") st_store(., group_vars[i], unique_groups[., i]) } // Add summed c_ variables (with progress indicator) printf("Creating %g result variables", n_c_vars) for (i = 1; i <= n_c_vars; i++) { stata("quietly generate int " + c_vars[i] + " = .") st_store(., c_vars[i], result[., i]) if (mod(i, 100) == 0) { printf(".") displayflush() } } printf(" done\n") printf("Completed: %g groups with %g c_ variables and %g grouping variables\n", n_groups, n_c_vars, n_group_vars) }
Comment