Putexcel problems with large files and many loops

Zach Conrad

Join Date: Sep 2023

Posts: 5
#1

Putexcel problems with large files and many loops

17 Sep 2023, 10:46

Hi Stata users. I'm using Stata 16.1 on Windows 11 (12 GB RAM), and I've noticed that the putexcel command doesn't reliably transfer the results from Stata commands into Excel spreadsheets when there are many loops and results to write and when the file size is too big (see code below; this code runs for about 24 hours on my computer). I have 33K observations and 4K variables. When I say "reliably transfer the results" I mean that sometimes the data for entire columns are not transferred to the Excel spreadsheets even though the relevant command line is valid and will produce results in the Stata window. In other cases, Stata doesn't transfer the results into some cells in Excel and this appears to be random (as opposed to the situation I noted above in which entire columns are sometimes omitted). Has anyone experienced this before?

foreach impact in TFD_CO2eq TFD_MJ TFD_WSFleq purch_cost {
foreach index in HEI AHEI NRF Med aMed hPDI DASH {
forval quintile=1/5 {
local k=1
foreach food in _tot dairy 101 102 103 104 105 106 ///
protein redmeat 201 202 203 204 205 206 207 plantprot 208 209 211 ///
sandwich 301 302 303 305 ///
soup 401 402 403 soupplant 404 405 406 ///
501 ///
grain grainref 601 602 603 604 605 606 607 608 609 grainwhole 701 702 703 704 705 706 707 708 709 ///
fruit fruitwhole 801 802 803 fruitjuice 804 805 ///
veg 901 902 903 904 905 ///
oil 911 912 913 ///
dessert 921 922 923 924 925 926 ///
bev 931 soda 932 933 935 936 937 938 939 ///
999 {
foreach x in `impact'`food' {
putexcel set "C:\Users\[file path]", sheet ("`impact'_`index'_`quintile'") modify
putexcel A1=("category") B1=("food") C1=("index quintile") D1=("n") E1=("mean") F1=("se") G1=("t") H1=("p") I1=("lb") J1=("ub") K1=("df") L1=("crit") M1=("eform")
local varlabel: var label `x'
local k=`k'+1
capture quietly svy, subpop(if sample==1 & `index'_quant==`quintile'): reg `impact'`food' kcal wave
putexcel D`k'=(e(N_sub))
capture quietly margins, at(kcal=2023) subpop(if sample==1 & `index'_quant==`quintile') vce(unconditional) post
matrix M=r(table)'
capture quietly putexcel E`k'=matrix (M)
capture quietly putexcel A`k'="`impact'" B`k'="`varlabel'" C`k'="`index'_`quintile'"
}
}
}
}
}
Tags: None

Daniel Schaefer

Join Date: Mar 2020
Posts: 822

18 Sep 2023, 15:47

Hi Zack, welcome to the forum.

This is a little hard to debug for a few reasons. First, you don't provide example data, so its difficult to run your code to reproduce the error. Second, the formatting here is a little wild, and the code is difficult to read. The CODE tags really help here. I might format this like so:

Code:

foreach impact in TFD_CO2eq TFD_MJ TFD_WSFleq purch_cost {
    foreach index in HEI AHEI NRF Med aMed hPDI DASH {
        forval quintile=1/5 {
            local k=1
            foreach food in _tot dairy 101 102 103 104 105 106 protein redmeat 201 ///
            202 203 204 205 206 207 plantprot 208 209 211 sandwich 301 302 303 305 ///
            soup 401 402 403 soupplant 404 405 406 501 grain grainref 601 602 603 ///
            604 605 606 607 608 609 grainwhole 701 702 703 704 705 706 707 708 709 ///
            fruit fruitwhole 801 802 803 fruitjuice 804 805 veg 901 902 903 904 905 ///
            oil 911 912 913 dessert 921 922 923 924 925 926 bev 931 soda 932 933 935 ///
            936 937 938 939 999 {
                foreach x in `impact'`food' {
                    putexcel set "C:\Users\[file path]", ///
                        sheet ("`impact'_`index'_`quintile'") modify
                    putexcel A1=("category") B1=("food") C1=("index quintile") ///
                        D1=("n") E1=("mean") F1=("se") G1=("t") H1=("p") I1=("lb") ///
                        J1=("ub") K1=("df") L1=("crit") M1=("eform")
                    local varlabel: var label `x'
                    local k=`k'+1
                    capture quietly svy, subpop(if sample==1 & `index'_quant==`quintile'): ///
                        reg `impact'`food' kcal wave
                    putexcel D`k'=(e(N_sub))
                    capture quietly margins, at(kcal=2023) subpop(if sample==1 & ///
                        `index'_quant==`quintile') vce(unconditional) post
                    matrix M=r(table)'
                    capture quietly putexcel E`k'=matrix (M)
                    capture quietly putexcel A`k'="`impact'" B`k'="`varlabel'" ///
                        C`k'="`index'_`quintile'"
                }
            }
        }
    }
}

I notice your innermost loop appears to have no effect because there is only one `impact'`food' value, except to set x equal to `impact'`food' for a single iteration. Also note that you don't need to preform -putexcel set- on each iteration of the "food" loop. Just put the -putexcel set- command on the fifth line of the loop.

You also don't need to write the file to your hard drive any time you make a change to the excel document. The -putexcel set- open option should fix this. That should make the algorithm much faster.

I'm curious to know why you want to -capture- errors from the innermost loop. Using -capture- on these lines would seem to imply that there are times when you expect that code to raise an error and you don't want to stop execution of the loop when it does. Seems like, as a side effect, this might be hiding the error that causes the column to be missing. I can see why you might want to do some of this -quietly- since it takes extra time to write things to the console, but note that this can also make it difficult to find problems when you are debugging.

Also seems like you aren't writing the value of the food column correctly. Edit: okay, so what you were doing here with the food column should work, I just don't have the same labels you do (or any labels really). My mistake.

I randomly generate some data that I think should look vaguely like your data, then I edit your code for each of these issues to get the following:

Code:

clear
set obs 33000

foreach index in HEI AHEI NRF Med aMed hPDI DASH{
    gen `index'_quant = runiformint(1, 5)
}

foreach impact in TFD_CO2eq TFD_MJ TFD_WSFleq purch_cost {
    foreach food in _tot dairy 101 102 103 104 105 106 protein redmeat ///
        201 202 203 204 205 206 207 plantprot 208 209 211 sandwich 301 ///
        302 303 305 soup 401 402 403 soupplant 404 405 406 501 grain ///
        grainref 601 602 603 604 605 606 607 608 609 grainwhole 701 ///
        702 703 704 705 706 707 708 709 fruit fruitwhole 801 802 803 ///
        fruitjuice 804 805 veg 901 902 903 904 905 oil 911 912 913 ///
        dessert 921 922 923 924 925 926 bev 931 soda 932 933 935 936 ///
        937 938 939 999 {
            gen `impact'`food' = runiform()
        }
}

gen wave = runiformint(1, 8)
gen kcal = runiformint(2000, 2030)
gen psu = runiformint(1, 50)
gen sample = runiformint(0, 1)

svyset psu

Code:

timer clear 1
timer on 1
foreach impact in TFD_CO2eq TFD_MJ TFD_WSFleq purch_cost {
    foreach index in HEI AHEI NRF Med aMed hPDI DASH {
        forval quintile=1/5 {
            local k=1
            di "Setup new putexcel for impact:`impact' index:`index' quintile:`quintile'"
            quietly putexcel set "[file path]", ///
                sheet ("`impact'_`index'_`quintile'") modify open
            foreach food in _tot dairy 101 102 103 104 105 106 protein redmeat ///
            201 202 203 204 205 206 207 plantprot 208 209 211 sandwich 301 ///
            302 303 305 soup 401 402 403 soupplant 404 405 406 501 grain ///
            grainref 601 602 603 604 605 606 607 608 609 grainwhole 701 ///
            702 703 704 705 706 707 708 709 fruit fruitwhole 801 802 803 ///
            fruitjuice 804 805 veg 901 902 903 904 905 oil 911 912 913 ///
            dessert 921 922 923 924 925 926 bev 931 soda 932 933 935 936 ///
            937 938 939 999 {
                putexcel A1=("category") B1=("food") C1=("index quintile") ///
                    D1=("n") E1=("mean") F1=("se") G1=("t") H1=("p") ///
                    I1=("lb") J1=("ub") K1=("df") L1=("crit") M1=("eform")
                //local varlabel: var label `impact'`food'
                local k=`k'+1
                quietly svy, subpop(if sample==1 & `index'_quant==`quintile'): ///
                    reg `impact'`food' kcal wave
                putexcel D`k'=(e(N_sub))
                quietly margins, at(kcal=2023) subpop(if sample==1 & ///
                    `index'_quant==`quintile') vce(unconditional) post
                matrix M=r(table)'
                putexcel E`k'=matrix (M)
                putexcel A`k'="`impact'" B`k'="`impact'`food'" C`k'="`index'_`quintile'"
            }
            putexcel save
        }
    }
}
timer off 1
timer list 1

I don't test your code directly because I believe you when you say it takes 24 hours to run, and I don't want to tie up my Stata installation for that long. I ran the code I provide above and it took about an hour and 35 minutes (4857.99) seconds. However, I was not able to reproduce your issue. All of the columns have data filled in (at least, they do after I fix the issue with the "food" column) and I'm not seeing any missing cells here as I flip through the results.

Last edited by Daniel Schaefer; 18 Sep 2023, 16:40.

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 822
#3

18 Sep 2023, 16:33

You know, after giving this a little more thought, my best guess is that some of the models you are estimating turn out to be nonsensical for some idiosyncratic reason, and you are getting missing values for estimated coefficients for that reason. Regardless, the problem is almost certainly not with the -putexcel- command failing to transfer some data, nor should you be anywhere close to running out of application memory (and even if you did, it wouldn't manifest like this).
Comment
Zach Conrad

Join Date: Sep 2023

Posts: 5
#4

23 Sep 2023, 10:52

Thanks for your help and patience, Daniel. This is really helpful feedback. I will make some of the changes you suggested and see what happens. I also used -compress- before I ran the analyses and then the results transferred fine, for what that's worth. But it sounds like the open option will really help reduce processing time, so I'm going to incorporate that for sure.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 822
#5

23 Sep 2023, 15:41

No problem Zach, I'm sorry I couldn't help you find the source of the original bug. When I got your message earlier today, I went ahead and expanded the dataset to include 4000 additional variables with random double precision floating point numbers in Stata 16. Still couldn't reproduce the issue.

I'm surprised -compress- fixed the issue. If you run out of memory, Stata should just use virtual memory on the disk. That'll slow things down substantially, but shouldn't randomly prevent data from being written to a file without any kind of error message. There's actually some documentation on this you might find interesting here and here. If there is a bug, and it's related to how much memory is allocated, I'd expect there is a segmentation fault that should cause Stata to crash, or Stata should recognize there isn't sufficient memory allocated and the error should get thrown up the call stack as an error message.

Very strange. Glad you managed to fix the issue though.
Comment

Announcement