Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fun Stata fact: You can use scalars instead of locals in while loops, without speed losses in execution.

    Good afternoon,

    I was toying around today with measuring the speed of different forms of loops. Then I realised that if I go for the -while- style of loop, I can do everything that is done with locals, but using only scalars instead of locals.

    I think this Stata feature is not documented anywhere in the Stata literature, and I have never seen it applied anywhere. As far as I can see this approach does not seem to have any disadvantages, and has pedagogical value. I had headache getting in the beginning what is this thing in Stata called a "local", then I had headache explaining to students what is this thing called "local" when I was teaching Stata to others. Mostly not because local is a complicated thing, but rather because StataCorp invented it, and it is alien when you hear of it first time. On the other hand everybody knows what is a scalar, no need for much explaining here.

    Here is a simple example, we are calculating the sum of the observations in a variable, while pretending we don't know of the -gen,sum()-, -egen, mean()- and the command -summarize- existence.

    Code:
    clear 
    
    set obs 1000000
    
    gen x = rnormal()
    
    timer clear
    
    timer on 1 // Standard Forvalues loop. 
    sca Ans = 0
    qui forvalues i = 1/`=_N' {
    sca Ans = Ans + x[`i']
    }
    dis Ans
    timer off 1
    
    timer on 2 // Standard While loop but with decrementation. 
    sca Ans = 0
    local i = _N
    qui while `i'>0 {
    sca Ans = Ans + x[`i']
    local --i
    }
    dis Ans
    timer off 2
    
    timer on 3 // Standard While loop but with incrementation.
    sca Ans = 0
    local i = 1
    qui while `i'<=`=_N' {
    sca Ans = Ans + x[`i']
    local ++i
    }
    dis Ans
    timer off 3
    
    timer on 4 // While loop using only scalars, decrementation. 
    sca Ans = 0
    sca I = _N
    qui while I>0 {
    sca Ans = Ans + x[I]
    sca I = I - 1
    }
    dis Ans
    timer off 4
    
    timer on 5 // While loop using only scalars, incrementation. 
    sca Ans = 0
    sca I = 1
    sca N = _N
    qui while I<=N {
    sca Ans = Ans + x[I]
    sca I = I +1
    }
    dis Ans
    timer off 5
    
    timer list
    
    * They all give the same answer and the timings are:
    . timer list
       1:      5.79 /        1 =       5.7880
       2:      7.97 /        1 =       7.9660
       3:      9.60 /        1 =       9.6040
       4:      8.21 /        1 =       8.2060
       5:      8.71 /        1 =       8.7050
    1) As expected the -forvalues- loop is the fastest.

    2) Unexpectedly, the decrementation -while- loops are somewhat faster than the incrementation -while- loops. Does anybody know why is that?

    3) The scalar loops seem to be doing fine. There is no loss of speed of execution from moving from -while- with locals to -while- with scalars.

    Last edited by Joro Kolev; 16 May 2021, 05:50.

  • #2
    Originally posted by Joro Kolev View Post
    [...]
    I realised that if I go for the -while- style of loop, I can do everything that is done with locals, but using only scalars instead of locals.

    I think this Stata feature is not documented anywhere in the Stata literature, and I have never seen it applied anywhere. As far as I can see this approach does not seem to have any disadvantages
    The one (serious) disadvantage with scalars is that they are global in scope and you might accidentally overwrite stuff. You can get (at least halfway) around that by using temporary names; but then there is little gained when compared to the local approach.

    Edit:

    Almost forgot. There is another serious issue. Scalars are not only global in nature but share the same name space with variables. It is all too easy to run into an endless loop:

    Code:
    // !! will produce an endless loop; use -break- Key to interrupt
    sysuse auto
    
    scalar mpg = 10
    
    while (mpg > 0) {
        display "endless loop"
        scalar mpg = mpg - 1
    }

    Originally posted by Joro Kolev View Post
    I had headache getting in the beginning what is this thing in Stata called a "local", then I had headache explaining to students what is this thing called "local" when I was teaching Stata to others.
    When I explain locals to someone who has not programming experience whatsoever, I just say that it is a placeholder. For someone with programming experience, I tell them that it is similar to a (transmorphic) scalar, i.e., it is a scalar for which we omit the type declaration (as seems to be quite common in other programming languages). The difference between local and global scope and, more generally, namespaces, is often best explained by examples.


    Originally posted by Joro Kolev View Post
    2) Unexpectedly, the decrementation -while- loops are somewhat faster than the incrementation -while- loops. Does anybody know why is that?
    Sergio Correia has a great collection of Miscellaneous Mata Tips on which he points to this for an explanation.


    Last edited by daniel klein; 16 May 2021, 07:28.

    Comment


    • #3
      Just a note, I wouldn't trust these timings too much as they are only taken from one run and can also depend on hardware / OS. For me the forvalues loop is actually the slowest, but the difference among the 5 ways are very small.
      Last edited by Wouter Wakker; 16 May 2021, 09:15.

      Comment


      • #4
        The difference between the -forvalues- and the standard incrementation -while- is substantial,
        Code:
        . dis 9.60/5.79
        1.6580311
        This is the outcome of another run, this time the laptop is unplugged:
        Code:
        . timer list
           1:      5.53 /        1 =       5.5260
           2:      8.05 /        1 =       8.0480
           3:      9.81 /        1 =       9.8060
           4:      8.43 /        1 =       8.4330
           5:      8.76 /        1 =       8.7550
        
        . dis  9.8060/5.5260
        1.7745204
        and the difference between -forvalues- and -while- is still substantial; ordering is still the same. Of course there is variation from run to run, but for people who cannot tolerate random variation theology is more appropriate field of study and work than statistics :-).

        I am doing the runs of Stata 15.1, processor i5-8250U, Windows 10.

        Can you show the results of your run and the parameters of your system ?


        Originally posted by Wouter Wakker View Post
        Just a note, I wouldn't trust these timings too much as they are only taken from one run and can also depend on hardware / OS. For me the forvalues loop is actually the slowest, but the difference among the 5 ways are very small.

        Comment


        • #5
          The standard forvalues loop is also the slowest for me.

          Code:
          . timer list
             1:     13.61 /        1 =      13.6110
             2:      2.87 /        1 =       2.8710
             3:      3.53 /        1 =       3.5310
             4:      3.18 /        1 =       3.1760
             5:      3.40 /        1 =       3.4040
          Code:
          . about
          
          Stata/SE 16.1 for Mac (Intel 64-bit)
          Revision 06 Apr 2021
          Copyright 1985-2019 StataCorp LLC
          
          Total physical memory: 64.00 GB

          Comment


          • #6
            Code:
            . timer list
               1:     79.74 /       10 =       7.9738
               2:     39.25 /       10 =       3.9251
               3:     48.02 /       10 =       4.8016
               4:     40.14 /       10 =       4.0141
               5:     42.98 /       10 =       4.2979
            Stata/MP2 17.0 Revision 05 May 2021, MS Windows 10, Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz

            Code:
            . timer list
               1:     36.79 /       10 =       3.6792
               2:     59.56 /       10 =       5.9561
               3:     74.59 /       10 =       7.4595
               4:     62.81 /       10 =       6.2810
               5:     66.06 /       10 =       6.6061
            
            .
            end of do-file
            
            . about
            
            Stata/IC 16.0 for Unix (Linux 64-bit x86-64)
            Also,try:
            Code:
            * While loop with decrementation. 
            
            sca Ans = 0
            local i = _N
            
            qui while `i' {
               sca Ans = Ans + x[`i']
               local --i
            }
            Last edited by Bjarte Aagnes; 16 May 2021, 11:16. Reason: added while variant

            Comment


            • #7
              Code:
                  Variable |         N      Mean
              -------------+--------------------
                        t1 |        10    3.0417
                        t2 |        10    3.3435
                        t3 |        10    4.0785
                        t4 |        10    3.5424
                        t5 |        10    3.6557
              ----------------------------------
              Stata/MP 17.0 for Windows (64-bit x86-64) Revision 05 May 2021, Windows 10 Pro, Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz.
              Results were nominally the same for Stata/MP 16.1 on the same system.

              Comment


              • #8
                Thanks Joro, that's interesting to see. Find my stats below (Stata 16.1, Linux Mint 19.3, AMD Ryzen 5 3600X, 16GB RAM).

                Code:
                . timer list
                   1:      3.94 /        1 =       3.9360
                   2:      3.52 /        1 =       3.5240
                   3:      4.50 /        1 =       4.4960
                   4:      3.74 /        1 =       3.7390
                   5:      3.99 /        1 =       3.9930
                Best wishes

                (Stata 16.1 MP)

                Comment


                • #9
                  I am well aware of the two facts you mention, Daniel. In particular, your "There is another serious issue" is the topic of my first to appear in print refereed publication, see
                  Kolev, Gueorgui I. "Stata tip 31: Scalar or variable? The problem of ambiguous names." The Stata Journal 6, no. 2 (2006): 279-280.

                  For every problem there are solutions. There are different tastes among people, and depending on your taste/distaste for constantly writing `...' and constantly reading `...' , you might or might not be willing to make the approach of using only scalars work.

                  I personally write mostly do files for myself. In my do files none of the two pitfalls that you mention can happen, because I adhere to my own advice in Kolev(2006), and I strictly follow a convention. My variables always start with a lowercase letter, and my scalars and matrices always start with a capital letter, you could see this in my code in #1 by the way. The chance to overwrite a scalar unintentionally in my do file by a scalar loop, is exactly the same as the chance to overwrite a scalar by defining a second scalar with the same name, or to overwrite a local by defining conceptually different local with the same name, here
                  Code:
                  . local l = 3
                  
                  . local l = 5
                  
                  . dis `l'
                  5
                  we just overwrote a local. If this is not what we intended, yes, there is a problem, but there is no other way out of this risk, except for us paying attention.

                  Originally posted by daniel klein View Post

                  The one (serious) disadvantage with scalars is that they are global in scope and you might accidentally overwrite stuff. You can get (at least halfway) around that by using temporary names; but then there is little gained when compared to the local approach.

                  Edit:

                  Almost forgot. There is another serious issue. Scalars are not only global in nature but share the same name space with variables. It is all too easy to run into an endless loop:

                  Code:
                  // !! will produce an endless loop; use -break- Key to interrupt
                  sysuse auto
                  
                  scalar mpg = 10
                  
                  while (mpg > 0) {
                  display "endless loop"
                  scalar mpg = mpg - 1
                  }



                  Comment


                  • #10
                    My stats (Linux Ubuntu 18.04.5, Intel i7-9700 @ 3.00GHz x 8, 32GB RAM) with Stata 16.1:
                    Code:
                    . timer list
                       1:      3.90 /        1 =       3.9040
                       2:      3.81 /        1 =       3.8090
                       3:      4.77 /        1 =       4.7670
                       4:      3.99 /        1 =       3.9910
                       5:      4.26 /        1 =       4.2560
                    
                    . di r(t3)/r(t1)
                    1.2210553
                    ... with Stata 17.0:
                    Code:
                    . timer list
                       1:      4.17 /        1 =       4.1690
                       2:      3.87 /        1 =       3.8740
                       3:      4.82 /        1 =       4.8190
                       4:      4.09 /        1 =       4.0910
                       5:      4.34 /        1 =       4.3450
                    
                    . di r(t3)/r(t1)
                    1.1559127
                    However, as Daniel wrote in #2, scalars as used by Joro have global scope and share the same name space with variables. You could use -scalar()- and temporary names, but here you can run into problems as well (see my post "Temporary names for a scalar: Dangerous advice").

                    Thus, for a foolproof use of scalars I rewrote Joro's program using temporary names and -scalar()- when referring to scalars:
                    Code:
                    clear
                    set obs 1000000
                    
                    set seed 1
                    gen x = rnormal()
                    timer clear
                    tempname Ans I N
                    
                    qui {
                       timer on 1 // Standard Forvalues loop.
                       sca `Ans' = 0
                          qui forvalues i = 1/`=_N' {
                          sca `Ans' = scalar(`Ans') + x[`i']
                       }
                       dis scalar(`Ans')
                       timer off 1
                      
                       timer on 2 // Standard While loop but with decrementation.
                       sca `Ans' = 0
                       local i = _N
                       qui while `i'>0 {
                          sca `Ans' = scalar(`Ans') + x[`i']
                          local --i
                       }
                       dis scalar(`Ans')
                       timer off 2
                      
                       timer on 3 // Standard While loop but with incrementation.
                       sca `Ans' = 0
                       local i = 1
                       qui while `i'<=`=_N' {
                          sca `Ans' = scalar(`Ans') + x[`i']
                          local ++i
                       }
                       dis scalar(`Ans')
                       timer off 3
                      
                       timer on 4 // While loop using only scalars, decrementation.
                       sca `Ans' = 0
                       sca `I' = _N
                       qui while scalar(`I')>0 {
                          sca `Ans' = scalar(`Ans') + x[scalar(`I')]
                          sca `I' = scalar(`I') - 1
                       }
                       dis scalar(`Ans')
                       timer off 4
                      
                       timer on 5 // While loop using only scalars, incrementation.
                       sca `Ans' = 0
                       sca `I' = 1
                       sca `N' = _N
                       qui while scalar(`I')<=scalar(`N') {
                          sca `Ans' = scalar(`Ans') + x[scalar(`I')]
                          sca `I' = scalar(`I') +1
                       }
                       dis scalar(`Ans')
                       timer off 5
                    }
                    
                    timer list
                    di r(t3)/r(t1)
                    Then the the time difference between the use of -while- instead of -forvalues- is not so large anymore. With Stata 16.1:
                    Code:
                    . timer list
                       1:      6.08 /        1 =       6.0750
                       2:      5.46 /        1 =       5.4600
                       3:      6.43 /        1 =       6.4340
                       4:     10.01 /        1 =      10.0090
                       5:     11.75 /        1 =      11.7490
                    
                    . di r(t3)/r(t1)
                    1.0590947
                    ... and with Stata 17.0:
                    Code:
                    . timer list
                       1:      5.77 /        1 =       5.7670
                       2:      5.52 /        1 =       5.5180
                       3:      6.49 /        1 =       6.4880
                       4:     10.38 /        1 =      10.3830
                       5:     12.10 /        1 =      12.1010
                    
                    . di r(t3)/r(t1)
                    1.1250217
                    Last edited by Dirk Enzmann; 16 May 2021, 14:34.

                    Comment


                    • #11
                      Joro,
                      The reason you see the application of this in my code in the response to one of your recent other posts with the link below is that it is copied from one of my do files. If I remember correctly the reason for using scalar instead of a local was that I could not find a way to pass a local argument from outside a program (most likely rangerun) to inside the program or had difficulty changing the extension of a dummy variable (i.e. Dum1, Dum2, Dum3, ...etc) to be used in a rolling regression while executing a rangerun program. There could be a way of passing locals from outside the program in a dynamic way but for the life of me I could not figure out what solution that was. So, I experimented with the scalar approach and it delivered the results without any drawback. So, unless there is a way or easier way passing locals in and outside a program, using scalars might have an added value not mentioned so far in this thread.


                      https://www.statalist.org/forums/for...49#post1609249

                      Comment


                      • #12
                        Oscar Ozfidan You have correctly observed a limitation of -rangerun- (and also of -runby-). There is no direct way to pass a local macro from the calling program to the program called by -rangerun-/-runby-. Both -rangerun- and -runby- prohibit the program they call from taking arguments, which would be the normal way to pass information contained in a local macro.

                        I am aware of only two solutions. Scalars are one option. The other way is to create a new variable containing the contents of the local macro in the data set. I personally prefer the use of a variable to pass the information because it avoids the problems that can arise from inadvertent name clash between the scalar and an existing variable or existing scalar: Stata won't let you create a new variable with the same name as an existing one. But it is wasteful of memory to use a variable for this purpose, and when working with large data sets, as is typically the case with -runby-, that might override the safety considerations if you are careful and confident of what else is flying around in the program environment.

                        Comment


                        • #13
                          Clyde Schechter Thank you for letting me know that I was not missing out on a secret rangerun command to get around that issue. Creating and using variables to make rangerun work is definitely something I was and I am utilizing especially to define what the rolling window is etc. However, I am executing hundreds of files at times with varying data lengths (i.e. _N) That is where use of scalar in rangerun program has the most value for me.

                          Comment


                          • #14
                            Maybe StataCorp can comment on the observed differences between -forvalues- and -while- in [#6] vs [#7], which are not reporting timings from one run only, following good advice in [3].

                            In [1]:
                            I had headache explaining to students what is this thing called "local" when I was teaching Stata to others. Mostly not because local is a complicated thing, but rather because StataCorp invented it, and it is alien when you hear of it first time. On the other hand everybody knows what is a scalar, no need for much explaining here.
                            A fact: Stata did not invent locals, macros, or scoping.

                            A described in this thread; scalars may have same name as variables (which take precedence), and using tempname does not solve the name collision problem [#10], while the pseudo function scalar() will despite the advise in the manual prefering tempname over scalar().

                            A bit confusing then is the current description of namespace in [P] matrix — Introduction to matrix commands "Namespace":

                            The term "namespace" refers to how names are interpreted. For instance, the variables in your
                            dataset occupy one namespace—other things, such as value labels, macros, and scalars, can have the
                            same name and not cause confusion.
                            (probably: Stata will not be confused) Obviously, name collision can occur between scalars and variables, also when using tempname, and cause serious problems. So, to avoid name conflicts it seems a good naming convention is better [#9] than trusting tempname alone, and if tempname is used for naming scalars it should be supplemented with scalar().

                            Example explicit naming scheme using global/public scalar (using scalar() might be "to much"):
                            Code:
                            assert float(scalar(num_scalar_some_good_name)) !=scalar(num_scalar_some_good_name)
                            While if defining local/privat scalars via tempname the scalar() should be used ([10]).
                            Code:
                            assert float(scalar(`num_scalar_some_good_name')) !=scalar(`num_scalar_some_good_name')

                            Comment


                            • #15
                              Although Daniel Klein and Dirk Enzmann are technically right regarding what they are saying, yet this is overthinking of the matter according to StataCorp.

                              I complained to StataCorp that what they are saying in the manual for -tempvar- and -tempname- is a little bit absolutely wrong. You can see the full correspondence below, but in short -tempvar- and -tempname- do not check for anything when they assign the names, unlike what is claimed in the manual, and they always start deterministically from __000000 -- Daniel Klein and Dirk Enzmann told me about this sometime in March and then I wrote to Stata Technical Support to hear their view on the matter.

                              The answer by Stata Technical Support was simple: a) We will explain this better in the manual in the future b) we told you in the manual not to start the names of your Stata objects with an underscore (_) or two underscores (__).

                              I am not any more pedantic than StataCorp, and hence I have no ambitions to perfect my use of scalars and make use of scalars safe to an extend that even StataCorp do not care about.

                              Hence a fully reasonable policy here could be:

                              1. If we use of scalars in our own do files, adopt a convention like I do, say variables always start with a lowercase letter, scalars and matrices always start with a capital letter.

                              2. If we write ado code to be used by other people, we can just do what StataCorp does, call our scalars with names like _i, _j, __i, __j, etc, and if a conflict arises because the user has called something with this name, just blame it on the user with the explanation "In the manual it is written that you should not start the names your objects with underscores." By the way the list of Stata reserved system variables is really short, they are less than 10. Then when Stata uses -tempvar- and -tempname-, they always look the same, two underscores (__) followed by 6 digits.

                              Here is the full correspondence:
                              Gueorgui Kolev Wed, Mar 24, 2021 at 9:19 AM
                              To: Stata Technical Support
                              Good morning,


                              In a recent thread on Statalist it became apparent that the facilities tempvar and tempname do not behave at all as advertised in the manual:
                              https://www.statalist.org/forums/for...able-as-a-stub


                              Stata 15, programming manual, p. 301 reads "The tempvar sumsq command creates a local macro called sumsq and stores in it a name that is different from any name currently in the data."

                              As it turns out Stata does not check whether the chosen name already exists in the data, and always deterministically starts from __000000. Here are two examples.


                              . clear

                              .
                              . set obs 3
                              number of observations (_N) was 0, now 3

                              .
                              . gen __000000 = 1000000

                              .
                              . tempname myscalar

                              .
                              . scalar `myscalar' = 7

                              .
                              . dis `myscalar'
                              1000000

                              .
                              . summ `myvar'

                              Variable | Obs Mean Std. Dev. Min Max
                              -------------+---------------------------------------------------------
                              __000000 | 3 1000000 0 1000000 1000000


                              What we see in this example is that tempname did not check for anything in the existing data, it simply assigned __000000 as a name, despite that there was already such a variable in the data. And then this created a problem, because Stata chose the variable interpretation over the scalar, as I explained in this Stata Tip: Kolev, Gueorgui I. "Stata tip 31: Scalar or variable? The problem of ambiguous names." The Stata Journal 6, no. 2 (2006): 279-280..

                              Here is the same example, but with a tempvar now:

                              . clear

                              .
                              . set obs 3
                              number of observations (_N) was 0, now 3

                              .
                              . gen __000000 = 1000000

                              .
                              . tempvar myvar

                              .
                              . gen `myvar' = 3
                              variable __000000 already defined
                              r(110);


                              Again the tempvar facility did not check for anything, it simply assigned the name __000000, and as it happened variable with a name like this already existed, so we got an error.

                              At the end it all comes to that tempvar and tempname facilities simply rely on the hope that nobody would choose variables and scalars in his/her code having names such as __000000, __000001, etc. Contrary to what the manual claims, tempvar and tempname facilities do not check for anything in the existing working space.

                              Best regards,

                              Gueorgui

                              Stata Technical Support Thu, Mar 25, 2021 at 2:49 PM
                              To: Gueorgui Kolev
                              I will pass your comments on the documentation and tempvar behavior to our
                              development group for consideration.

                              In general, we advise users not to create their own variables beginning with
                              underscores (_ or __) or to save variables with those names in their data.
                              Some of these values are reserved for Stata's own use as tempvar names,
                              system variables, etc. Of course, not all are 'reserved' due to the nature
                              of the language- but in some cases they can conflict.

                              We will look into making this advice more clear in our documentation in the
                              future.

                              Sincerely,
                              etc.

                              Comment

                              Working...
                              X