Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Memory Issue with User-Written Tuples Command

    Dear Statalist,

    I am using the user-written "tuples" command to loop through all possible pairs of variables in a dataset and am encountering what seems to be a memory issue. The dataset is quite wide: 1,204 variables in all, although it only has 1,650 observations. I am using the tuples command to generate all possible pairs for 1,203 of these variables and store them in local macros. The 1,203 "variables of interest" are all 10 characters in length, meaning the tuples command should return macros of length 21 (the names of each variable in the pair plus a space separating them). In total, tuples should return 723,003 macros (1,203 choose 2).

    ​I am well aware that this is quite a large number of macros for Stata to store, however, this older Statalist response indicates that Stata should not be limited the number of macros at issue here. That being said, I get the following error after executing the "tuples" command in the following code:

    Code:
    * Bring all variables of interest into r(varlist)
    ds date, not
    
    * Now calculate all possible pairs of prices
    quietly tuples `r(varlist)', min(2) max(2) varlist
    
    
     #:  3900  unable to allocate real <tmp>[1203,1447209]
                    tuples():     -  function returned error
                     <istmt>:     -  function returned error

    For some context, these 1,203 variables are price series, whereas the other variables hold the date. My goal is to use the macros stored by the tuples command in order to loop over variable pairs and calculate betas from rolling regressions of the price series on time and compare the estimated series of betas between the two products. Based on my Googling, it seems that this is a memory issue, but that is a little surprising to me. I am running Stata/MP 13.1 on a network machine with 370GB of disk space and 8 GB of RAM. Any assistance/advice with this issue would be appreciated.

    -Best,
    Michael G

  • #2
    You need more memory. For a matrix <tmp>[1203,1447209], (1203 row by 1447209 columns), you need 1203*1447209*8/1024/1024/1024 ~ 13 GB memory. Your machine only has 8GB RAM, which is far from enough considering you might also have a sizable dataset in memory.

    I would suggest try on a machine with at least 32GB memory.

    Comment


    • #3
      Hi Hua,

      Thank you for your response. I will identify a machine with more RAM.

      Just so I understand, it seems like from the error message (and your reply), that the tuples command generates a matrix and then places elements of that matrix into local macros. The initial creation of the matrix is causing the memory issue described above. Is that accurate?

      -Best,
      Michael G

      Comment


      • #4
        The original tuples command by Nick Cox was implemented in terms of nested forvalues loops in Stata. This turned out to be pretty slow for a "larger" list of elements (think n > 10) and the code was later moved to Mata by Joseph Luchman in order to create the desired combinations much faster. These improvements are based on creating an indicator matrix of (roughly) size 2^n [Edit: This matrix holds all possible binary combinations and is used to select the desired tuples and later fill the macros, indeed]. This is the memory limit that bites here and Hua Peng has kindly done the math to demonstrate the amount of memory that would be needed for this.

        As an alternative to the above suggestion, you can specify the nomata option with tuples and hope that the number of locals needed is indeed large enough. If so, I believe this could work. However, be prepared to wait for a long (looong) time for this to finish.

        I hope someone comes up with a better approach to achieve what you ultimately want to.

        Best
        Daniel
        Last edited by daniel klein; 09 Mar 2016, 16:11.

        Comment


        • #5
          Michael, yes, you are right that the initial creation of the matrix is causing the memory issue and Daniel gave an excellent explanation of the situation.

          Comment


          • #6
            Am I missing something? If the ultimate objective is to loop over all possible pairs of the variables of interest, why the tuples? Will the following not do the job?
            Code:
            * Bring a list of all variables of interest into r(varlist)
            ds date, not
            * Save the list into a local macro
            local prices `r(varlist)'
            * Now do something for all possible pairs of prices
            foreach v1 of local prices {
                foreach v2 of local prices {
                    // commands using `v1' and `v2' here
                }
            }

            Comment


            • #7
              You are right. I misread Michael's original post. If all Michael wants is all the pairs, he should not use -tuples- which produces all combinations.

              Comment


              • #8
                Originally posted by William Lisowski View Post
                Am I missing something? If the ultimate objective is to loop over all possible pairs of the variables of interest, why the tuples? Will the following not do the job?
                Code:
                * Bring a list of all variables of interest into r(varlist)
                ds date, not
                * Save the list into a local macro
                local prices `r(varlist)'
                * Now do something for all possible pairs of prices
                foreach v1 of local prices {
                foreach v2 of local prices {
                // commands using `v1' and `v2' here
                }
                }

                William,

                Now that I think about it, I believe you are correct, I could use a loop within a loop as you suggest. But I think that would involve performing these calculations twice (and comparing price1 with itself)? As I mentioned, I am interested in collecting the betas from a series of rolling regressions of these price series. What I'm most interested in is whether the sign of the estimated coefficient on the date variable for the regression of price1 matches the sign of the the same rolling regression for price2 for the same estimation window. If I'm not mistaken, your suggestion would (at least in the current implementation of my code) involve making the same comparison twice (i.e. I would compare the betas for price1 to price2 in the first iteration, and then compare the betas for price2 with price1 in the second iteration).

                I guess my question boils down to whether the initial computational fixed cost of generating unique pairs via -tuples- is worth "double counting" what I really want later. Given what Hua and Daniel said about the matrix size, it seems like I might be better off doing a loop within a loop as you suggest. I note that I am looping through all possible estimation windows (or at least, all possible windows of at least size 3) See additional code below.

                Code:
                        * Now calculate all possible pairs of products.
                        quietly tuples `r(varlist)', min(2) max(2) varlist
                        
                        local max_pair = `ntuples'
                        * Note that tuples stores the number of combinations in `ntuples'
                        
                        * Set up a postfile and loop through all possible combinations
                            local n = 1
                            tempfile comparison
                            postfile handle str50 series_1 str50 series_2 long window_length long total_windows long opposite_sign_count double opp_sign_b1 double opp_sign_b2 using `comparison'
                            forval i = 1/`max_pair' {
                                * Display a tracker
                                    di "Now on comparison `n' of `max_pair'"
                                * Collect the constituents of the pair and store them in local macros
                                    local var1: word 1 of `tuple`i''
                                    local var2: word 2 of `tuple`i''
                
                                * Count days where both price series are non-missing.
                                        quietly count if (`var1' != . & `var2' != .)
                                        local total_days = `r(N)'
                                            if `total_days' >= 30 {
                                                forval days = 3/`total_days' {
                                                di "Now on window `days' of `total_days'"
                                            quietly {
                                                preserve
                                                    * For each product pair, we want to estimate rolling betas of all possible time lengths for regressions. Then, for each estimation window, we compare whether the betas are of opposite sign for each of the products. Take advantage of the fact that for SLR, covariance divided by variance is equal to the estimated coefficient. Saves computation time.
                                                    cap rolling beta=(r(cov_12)/r(Var_2)), window(`days') clear nodots: corr `var1' date if `var1' != . & `var2' != ., covariance
                                                    generate window = `days'
                                                    generate product1 = "`var1'"
                                                    rename beta beta_`var1'
                
                                                    tempfile 1
                                                    save `1'
                                                restore
                                                preserve
                                                    cap rolling beta=(r(cov_12)/r(Var_2)), window(`days') clear nodots: corr `var2' date if `var1' != . & `var2' != ., covariance
                                                    generate product2 = "`var2'"
                                                    rename beta beta_`var2'
                
                                                    
                                                * Now we merge estimation results
                                                    quietly merge 1:1 start end using `1', nogen
                                                    
                                                * Create an indicator variable denoting whenever the betas are of opposite signs
                                                    quietly generate opp_sign = sign(beta_`var1') != sign(beta_`var2')
                                                    sum opp_sign, meanonly
                                                    
                                                * Store the count of windows that the betas are of opposite sign.
                                                    local opp_window_count = `r(sum)'
                                                    
                                                * Also would be helpful to record the averages betas when they are of opposite sign.
                                                if `opp_window_count' > 0 {
                                                    sum beta_`var1' if opp_sign == 1, meanonly
                                                    local b1_opp_sign = `r(mean)'
                
                                                    sum beta_`var2' if opp_sign == 1, meanonly
                                                    local b2_opp_sign = `r(mean)'        
                                                
                                                
                                                * Record the total number of regression windows.
                                                    count
                                                
                                                * Post these results into the postfile
                                                    post handle ("`var1'") ("`var2'") (`days') (`r(N)') (`opp_window_count')  (`b1_opp_sign') (`b2_opp_sign')
                                                    
                                                }
                                                
                                                else if `opp_window_count' == 0 {
                                                * Record the total number of regression windows.
                                                    count
                                                
                                                * Post these results into the postfile
                                                    post handle ("`var1'") ("`var2'") (`days') (`r(N)') (`opp_window_count')  (.) (.)
                                                }
                                                
                                                restore
                                                }
                                            }
                                        }
                                        else if `total_days' < 3 {
                                        }
                                        
                                local n = `n' + 1
                            }
                            postclose handle
                Apologies for awkward spacing of the code. And yes, I am all too acutely aware of how long this process is likely to take.

                -Michael

                Comment


                • #9
                  You can eliminate the double-counting from the nested-loop approach with a simple condition:

                  Code:
                  ds date, not
                  local prices `r(varlist)'
                  
                  foreach v1 of local prices {
                      foreach v2 of local prices {
                          if "`v1'" > "`v2'" {
                             do your thing with `v1' and `v2'
                         }
                      }
                  }
                  Note: This will not pair any given v1 with itself. GIven what you describe about your goal, pairing v1 with itself would be unnecessary because the answer is obviously positive.
                  Last edited by Clyde Schechter; 09 Mar 2016, 18:19. Reason: Add bold face to changes from code posted earlier by William Lisowski

                  Comment


                  • #10
                    To avoid two copies of each pair, just select the copy where the first variable's name is less than the second variable's name, lexicographically. The second local command is technically unneeded, but causes the variable pairs to appear in a sensible order, rather than in the order of appearance in the dataset.

                    Code:
                    sysuse auto, clear
                    * Bring a list of all numeric variables into r(varlist)
                    ds, has(type numeric)
                    * Save the list into a local macro
                    local vars `r(varlist)'
                    local vars : list sort vars
                    * Now do something for all possible pairs
                    foreach v1 of local vars {
                        foreach v2 of local vars {
                            * compare the variable NAMES not their values
                            if "`v1'" < "`v2'" {
                                display "`v1' and `v2'"
                            }
                        }
                    }
                    Code:
                    displacement and foreign
                    displacement and gear_ratio
                    displacement and headroom
                    displacement and length
                    displacement and mpg
                    displacement and price
                    displacement and rep78
                    displacement and trunk
                    displacement and turn
                    displacement and weight
                    foreign and gear_ratio
                    foreign and headroom
                    foreign and length
                    foreign and mpg
                    foreign and price
                    foreign and rep78
                    foreign and trunk
                    foreign and turn
                    foreign and weight
                    gear_ratio and headroom
                    gear_ratio and length
                    gear_ratio and mpg
                    gear_ratio and price
                    gear_ratio and rep78
                    gear_ratio and trunk
                    gear_ratio and turn
                    gear_ratio and weight
                    headroom and length
                    headroom and mpg
                    headroom and price
                    headroom and rep78
                    headroom and trunk
                    headroom and turn
                    headroom and weight
                    length and mpg
                    length and price
                    length and rep78
                    length and trunk
                    length and turn
                    length and weight
                    mpg and price
                    mpg and rep78
                    mpg and trunk
                    mpg and turn
                    mpg and weight
                    price and rep78
                    price and trunk
                    price and turn
                    price and weight
                    rep78 and trunk
                    rep78 and turn
                    rep78 and weight
                    trunk and turn
                    trunk and weight
                    turn and weight

                    Comment


                    • #11
                      Ah. I see now. I was unaware of this functionality. Thank you both Clyde and William. This is quite helpful.

                      If you'll permit me one last question, I was under the impression that when using an if command as in the examples above, Stata would compare the values for the two variables in the first observation of the dataset and then execute the following commands. As William's commented code notes, Stata is comparing the names rather than the values. Why is this the case in this code?

                      -Best,
                      Michael Gropper

                      Comment


                      • #12
                        I was under the impression that when using an if command as in the examples above, Stata would compare the values for the two variables in the first observation of the dataset and then execute the following commands. As William's commented code notes, Stata is comparing the names rather than the values. Why is this the case in this code?
                        Code:
                        // BECAUSE IT'S
                        if "`v1'" > "`v2'" { // COMPARE THE NAMES OF THE VARIABLES
                        
                        //  NOT
                        if `v1' > `v2' { // COMPARE VALUES OF VARIABLES IN OBSERVATION 1

                        Comment


                        • #13
                          Ah. Understood. Thank you both.

                          Comment

                          Working...
                          X