Memory Issue with User-Written Tuples Command

Michael Gropper

Join Date: Mar 2016

Posts: 5
#1

Memory Issue with User-Written Tuples Command

09 Mar 2016, 15:11

Dear Statalist,

I am using the user-written "tuples" command to loop through all possible pairs of variables in a dataset and am encountering what seems to be a memory issue. The dataset is quite wide: 1,204 variables in all, although it only has 1,650 observations. I am using the tuples command to generate all possible pairs for 1,203 of these variables and store them in local macros. The 1,203 "variables of interest" are all 10 characters in length, meaning the tuples command should return macros of length 21 (the names of each variable in the pair plus a space separating them). In total, tuples should return 723,003 macros (1,203 choose 2).

I am well aware that this is quite a large number of macros for Stata to store, however, this older Statalist response indicates that Stata should not be limited the number of macros at issue here. That being said, I get the following error after executing the "tuples" command in the following code:

Code:

* Bring all variables of interest into r(varlist) ds date, not * Now calculate all possible pairs of prices quietly tuples `r(varlist)', min(2) max(2) varlist #: 3900 unable to allocate real <tmp>[1203,1447209] tuples(): - function returned error <istmt>: - function returned error

For some context, these 1,203 variables are price series, whereas the other variables hold the date. My goal is to use the macros stored by the tuples command in order to loop over variable pairs and calculate betas from rolling regressions of the price series on time and compare the estimated series of betas between the two products. Based on my Googling, it seems that this is a memory issue, but that is a little surprising to me. I am running Stata/MP 13.1 on a network machine with 370GB of disk space and 8 GB of RAM. Any assistance/advice with this issue would be appreciated.

-Best,
Michael G
Tags: None
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#2

09 Mar 2016, 15:51

You need more memory. For a matrix <tmp>[1203,1447209], (1203 row by 1447209 columns), you need 1203*1447209*8/1024/1024/1024 ~ 13 GB memory. Your machine only has 8GB RAM, which is far from enough considering you might also have a sizable dataset in memory.

I would suggest try on a machine with at least 32GB memory.
1 like
Comment
Michael Gropper

Join Date: Mar 2016

Posts: 5
#3

09 Mar 2016, 16:05

Hi Hua,

Thank you for your response. I will identify a machine with more RAM.

Just so I understand, it seems like from the error message (and your reply), that the tuples command generates a matrix and then places elements of that matrix into local macros. The initial creation of the matrix is causing the memory issue described above. Is that accurate?

-Best,
Michael G
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#4

09 Mar 2016, 16:08

The original tuples command by Nick Cox was implemented in terms of nested forvalues loops in Stata. This turned out to be pretty slow for a "larger" list of elements (think n > 10) and the code was later moved to Mata by Joseph Luchman in order to create the desired combinations much faster. These improvements are based on creating an indicator matrix of (roughly) size 2^n [Edit: This matrix holds all possible binary combinations and is used to select the desired tuples and later fill the macros, indeed]. This is the memory limit that bites here and Hua Peng has kindly done the math to demonstrate the amount of memory that would be needed for this.

As an alternative to the above suggestion, you can specify the nomata option with tuples and hope that the number of locals needed is indeed large enough. If so, I believe this could work. However, be prepared to wait for a long (looong) time for this to finish.

I hope someone comes up with a better approach to achieve what you ultimately want to.

Best
Daniel

Last edited by daniel klein; 09 Mar 2016, 16:11.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#5

09 Mar 2016, 16:53

Michael, yes, you are right that the initial creation of the matrix is causing the memory issue and Daniel gave an excellent explanation of the situation.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

09 Mar 2016, 17:02

Am I missing something? If the ultimate objective is to loop over all possible pairs of the variables of interest, why the tuples? Will the following not do the job?

Code:

* Bring a list of all variables of interest into r(varlist)
ds date, not
* Save the list into a local macro
local prices `r(varlist)'
* Now do something for all possible pairs of prices
foreach v1 of local prices {
    foreach v2 of local prices {
        // commands using `v1' and `v2' here
    }
}

Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#7

09 Mar 2016, 17:20

You are right. I misread Michael's original post. If all Michael wants is all the pairs, he should not use -tuples- which produces all combinations.
Comment

Michael Gropper

Join Date: Mar 2016
Posts: 5

09 Mar 2016, 18:01

Originally posted by William Lisowski View Post

Am I missing something? If the ultimate objective is to loop over all possible pairs of the variables of interest, why the tuples? Will the following not do the job?

Code:

* Bring a list of all variables of interest into r(varlist)
ds date, not
* Save the list into a local macro
local prices `r(varlist)'
* Now do something for all possible pairs of prices
foreach v1 of local prices {
foreach v2 of local prices {
// commands using `v1' and `v2' here
}
}

William,

Now that I think about it, I believe you are correct, I could use a loop within a loop as you suggest. But I think that would involve performing these calculations twice (and comparing price1 with itself)? As I mentioned, I am interested in collecting the betas from a series of rolling regressions of these price series. What I'm most interested in is whether the sign of the estimated coefficient on the date variable for the regression of price1 matches the sign of the the same rolling regression for price2 for the same estimation window. If I'm not mistaken, your suggestion would (at least in the current implementation of my code) involve making the same comparison twice (i.e. I would compare the betas for price1 to price2 in the first iteration, and then compare the betas for price2 with price1 in the second iteration).

I guess my question boils down to whether the initial computational fixed cost of generating unique pairs via -tuples- is worth "double counting" what I really want later. Given what Hua and Daniel said about the matrix size, it seems like I might be better off doing a loop within a loop as you suggest. I note that I am looping through all possible estimation windows (or at least, all possible windows of at least size 3) See additional code below.

Code:

        * Now calculate all possible pairs of products.
        quietly tuples `r(varlist)', min(2) max(2) varlist
        
        local max_pair = `ntuples'
        * Note that tuples stores the number of combinations in `ntuples'
        
        * Set up a postfile and loop through all possible combinations
            local n = 1
            tempfile comparison
            postfile handle str50 series_1 str50 series_2 long window_length long total_windows long opposite_sign_count double opp_sign_b1 double opp_sign_b2 using `comparison'
            forval i = 1/`max_pair' {
                * Display a tracker
                    di "Now on comparison `n' of `max_pair'"
                * Collect the constituents of the pair and store them in local macros
                    local var1: word 1 of `tuple`i''
                    local var2: word 2 of `tuple`i''

                * Count days where both price series are non-missing.
                        quietly count if (`var1' != . & `var2' != .)
                        local total_days = `r(N)'
                            if `total_days' >= 30 {
                                forval days = 3/`total_days' {
                                di "Now on window `days' of `total_days'"
                            quietly {
                                preserve
                                    * For each product pair, we want to estimate rolling betas of all possible time lengths for regressions. Then, for each estimation window, we compare whether the betas are of opposite sign for each of the products. Take advantage of the fact that for SLR, covariance divided by variance is equal to the estimated coefficient. Saves computation time.
                                    cap rolling beta=(r(cov_12)/r(Var_2)), window(`days') clear nodots: corr `var1' date if `var1' != . & `var2' != ., covariance
                                    generate window = `days'
                                    generate product1 = "`var1'"
                                    rename beta beta_`var1'

                                    tempfile 1
                                    save `1'
                                restore
                                preserve
                                    cap rolling beta=(r(cov_12)/r(Var_2)), window(`days') clear nodots: corr `var2' date if `var1' != . & `var2' != ., covariance
                                    generate product2 = "`var2'"
                                    rename beta beta_`var2'

                                    
                                * Now we merge estimation results
                                    quietly merge 1:1 start end using `1', nogen
                                    
                                * Create an indicator variable denoting whenever the betas are of opposite signs
                                    quietly generate opp_sign = sign(beta_`var1') != sign(beta_`var2')
                                    sum opp_sign, meanonly
                                    
                                * Store the count of windows that the betas are of opposite sign.
                                    local opp_window_count = `r(sum)'
                                    
                                * Also would be helpful to record the averages betas when they are of opposite sign.
                                if `opp_window_count' > 0 {
                                    sum beta_`var1' if opp_sign == 1, meanonly
                                    local b1_opp_sign = `r(mean)'

                                    sum beta_`var2' if opp_sign == 1, meanonly
                                    local b2_opp_sign = `r(mean)'        
                                
                                
                                * Record the total number of regression windows.
                                    count
                                
                                * Post these results into the postfile
                                    post handle ("`var1'") ("`var2'") (`days') (`r(N)') (`opp_window_count')  (`b1_opp_sign') (`b2_opp_sign')
                                    
                                }
                                
                                else if `opp_window_count' == 0 {
                                * Record the total number of regression windows.
                                    count
                                
                                * Post these results into the postfile
                                    post handle ("`var1'") ("`var2'") (`days') (`r(N)') (`opp_window_count')  (.) (.)
                                }
                                
                                restore
                                }
                            }
                        }
                        else if `total_days' < 3 {
                        }
                        
                local n = `n' + 1
            }
            postclose handle

Apologies for awkward spacing of the code. And yes, I am all too acutely aware of how long this process is likely to take.

-Michael

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30061
#9

09 Mar 2016, 18:18

You can eliminate the double-counting from the nested-loop approach with a simple condition:

Code:

ds date, not local prices `r(varlist)' foreach v1 of local prices { foreach v2 of local prices { if "`v1'" > "`v2'" { do your thing with `v1' and `v2' } } }

Note: This will not pair any given v1 with itself. GIven what you describe about your goal, pairing v1 with itself would be unnecessary because the answer is obviously positive.

Last edited by Clyde Schechter; 09 Mar 2016, 18:19. Reason: Add bold face to changes from code posted earlier by William Lisowski
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#10

09 Mar 2016, 18:37

To avoid two copies of each pair, just select the copy where the first variable's name is less than the second variable's name, lexicographically. The second local command is technically unneeded, but causes the variable pairs to appear in a sensible order, rather than in the order of appearance in the dataset.

Code:

sysuse auto, clear
* Bring a list of all numeric variables into r(varlist)
ds, has(type numeric)
* Save the list into a local macro
local vars `r(varlist)'
local vars : list sort vars
* Now do something for all possible pairs
foreach v1 of local vars {
    foreach v2 of local vars {
        * compare the variable NAMES not their values
        if "`v1'" < "`v2'" {
            display "`v1' and `v2'"
        }
    }
}

Code:

displacement and foreign
displacement and gear_ratio
displacement and headroom
displacement and length
displacement and mpg
displacement and price
displacement and rep78
displacement and trunk
displacement and turn
displacement and weight
foreign and gear_ratio
foreign and headroom
foreign and length
foreign and mpg
foreign and price
foreign and rep78
foreign and trunk
foreign and turn
foreign and weight
gear_ratio and headroom
gear_ratio and length
gear_ratio and mpg
gear_ratio and price
gear_ratio and rep78
gear_ratio and trunk
gear_ratio and turn
gear_ratio and weight
headroom and length
headroom and mpg
headroom and price
headroom and rep78
headroom and trunk
headroom and turn
headroom and weight
length and mpg
length and price
length and rep78
length and trunk
length and turn
length and weight
mpg and price
mpg and rep78
mpg and trunk
mpg and turn
mpg and weight
price and rep78
price and trunk
price and turn
price and weight
rep78 and trunk
rep78 and turn
rep78 and weight
trunk and turn
trunk and weight
turn and weight

Comment

Michael Gropper

Join Date: Mar 2016

Posts: 5
#11

09 Mar 2016, 20:12

Ah. I see now. I was unaware of this functionality. Thank you both Clyde and William. This is quite helpful.

If you'll permit me one last question, I was under the impression that when using an if command as in the examples above, Stata would compare the values for the two variables in the first observation of the dataset and then execute the following commands. As William's commented code notes, Stata is comparing the names rather than the values. Why is this the case in this code?

-Best,
Michael Gropper
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30061
#12

09 Mar 2016, 21:57

I was under the impression that when using an if command as in the examples above, Stata would compare the values for the two variables in the first observation of the dataset and then execute the following commands. As William's commented code notes, Stata is comparing the names rather than the values. Why is this the case in this code?

Code:

// BECAUSE IT'S if "`v1'" > "`v2'" { // COMPARE THE NAMES OF THE VARIABLES // NOT if `v1' > `v2' { // COMPARE VALUES OF VARIABLES IN OBSERVATION 1
2 likes
Comment
Michael Gropper

Join Date: Mar 2016

Posts: 5
#13

10 Mar 2016, 06:51

Ah. Understood. Thank you both.
Comment

Announcement