Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thank you, Joro Kolev.

    Actually, I was referring to an example using your final code to solve a problem as the one you presented in #3.

    I think this would be great for Stata users to know because many of us probably have already came across such a problem of Stata hitting the limits when using trying to define a local for a variable that contains too many different values.

    As you said, the example from Ali in #2 is not the issue you solved.

    Comment


    • #17
      Nick and Otavio Conceicao , the application that I have in mind goes like this. I have the following data:

      Code:
      . sysuse auto, clear
      (1978 Automobile Data)
      
      . recode rep (1 = 12) (2 = 20) (3=28) (4=29) (5=47) (. =60)
      (rep78: 74 changes made)
      
      . tab rep
      
           Repair |
      Record 1978 |      Freq.     Percent        Cum.
      ------------+-----------------------------------
               12 |          2        2.70        2.70
               20 |          8       10.81       13.51
               28 |         30       40.54       54.05
               29 |         18       24.32       78.38
               47 |         11       14.86       93.24
               60 |          5        6.76      100.00
      ------------+-----------------------------------
            Total |         74      100.00
      Now imagine that I want to loop through the levels of rep, which are unevenly spaced. The current way how to do this in Stata would be

      Current way
      Code:
      . levelsof rep, local(reps)
      12 20 28 29 47 60
      
      . foreach l of local reps {
        2. dis `l'
        3. }
      12
      20
      28
      29
      47
      60
      The "new way" which I am proposing goes as follows:
      "New" way
      Code:
      . levelstovar varlevrep = rep
      
      . count if !missing(varlevrep)
        6
      
      . forvalues i = 1/`r(N)' {
        2. dis varlevrep[`i']
        3. }
      12
      20
      28
      29
      47
      60
      So here the Old way and the New way give the same results.

      However now imagine that I have to loop through the levels of a variable which has hundreds of thousands or millions of levels. Then the Old way hits into the Stata limits, and hits into these limits very fast if the user is using Intercooled Stata, as we discovered for the old version of the user contributed function from -egenmore-, -egen, xtile()- on this thread here:
      https://www.statalist.org/forums/for...d-to-not-occur
      (this is an issue that does not exist anymore, the author -egen, xtile()- Ulrich Kohler rewrote the function in such a way not to use anymore -levels- and the function now works for any number of levels).

      The moral of the story here is that one very generally cannot use the -levelsof- Old way to loop through unevenly spaced levels of a variable in programmes that are either
      1) to be used by other people, and you do not know what kind of Stata the other person has, on on what data the other person will try your command.
      2) or you are using it yourself, but in advance you know that you need it for a variable which has too many levels to be accommodated by -levelsof-.

      So this is the problem that I am trying to solve here.


      Originally posted by Nick Cox View Post
      I already answered #9 in #7 to some extent.



      I can't see that your desiderata are in general consistent.

      More crucially, I don't yet see how such a variable would be used in a way that helps more than any existing approach.

      Comment


      • #18
        As in #7 I generally recommend using egen's group() function, as at https://www.stata.com/support/faqs/d...-with-foreach/ --- that is, whenever something like statsby, collapse or rangestat (SSC) doesn't loop over levels automatically. .

        Comment


        • #19
          I know, Nick. For almost 20 years I have been following your recommendation of mapping the unequally spaced levels firstly to equally spaced levels through -egen, group()- and then looping through the equally spaced with -forvalues- (Method 1 in your reference below).

          This Method 1 approach has the advantage that it results in easy to write loops, easy to read loops, and it is pretty hard to mess things up using this approach.

          However, and I learnt this relatively recently in the last year or so, there is a catch. The catch is that the loops resulting from Method 1 are very very slow.

          So this is the interesting new point that came up in your post here, and which I have been studying in the last year or so: For example -statsby- beats in terms of speed Method 1 by something like 100 times in the experiments that I have done. There is a way how to write loops which is even faster than -statsby-, and this way is what motivates the topic of this thread.



          Originally posted by Nick Cox View Post
          As in #7 I generally recommend using egen's group() function, as at https://www.stata.com/support/faqs/d...-with-foreach/ --- that is, whenever something like statsby, collapse or rangestat (SSC) doesn't loop over levels automatically. .

          Comment


          • #20
            Thank you very much for presenting the application you have in mind, Joro Kolev !

            Best,

            Otavio

            Comment


            • #21
              Dear Joro Kolev ,

              I was wondering whether it is possible to adapt your 'levelstovar' command to replicate the same results with a string variable (e.g., variable 'make' in the auto dataset) instead of a numeric variable.

              It would be a valuable contribution!

              Comment


              • #22
                Thank you for the suggestion, Otavio. This passed through my mind at some point, and I will make -levelstovar- at some point accommodate string variables too.

                For the time being -- I do not know what application exactly you have on your mind -- note that string variables and nicely labelled numerical variables are almost perfect substitutes for all practical purposes.

                So the following might do the trick for you (I am attaching the current version of -levelstovar- to this message:

                Code:
                . sysuse auto, clear
                (1978 Automobile Data)
                
                . egen nummake = group(make), label
                
                . levelstovar mymake = nummake
                
                . label values mymake nummake
                
                . list make nummake mymake in 1/7
                
                     +-----------------------------------------------+
                     | make                  nummake          mymake |
                     |-----------------------------------------------|
                  1. | AMC Concord       AMC Concord     AMC Concord |
                  2. | AMC Pacer           AMC Pacer       AMC Pacer |
                  3. | AMC Spirit         AMC Spirit      AMC Spirit |
                  4. | Buick Century   Buick Century       Audi 5000 |
                  5. | Buick Electra   Buick Electra        Audi Fox |
                     |-----------------------------------------------|
                  6. | Buick LeSabre   Buick LeSabre        BMW 320i |
                  7. | Buick Opel         Buick Opel   Buick Century |
                     +-----------------------------------------------+
                What I did was to move the string variable make into a numeric variable nummake which is nicely labelled, and then I used -levelstovar- on the numeric nummake, and I reapplied the label for the values.


                Originally posted by Otavio Conceicao View Post
                Dear Joro Kolev ,

                I was wondering whether it is possible to adapt your 'levelstovar' command to replicate the same results with a string variable (e.g., variable 'make' in the auto dataset) instead of a numeric variable.

                It would be a valuable contribution!
                Attached Files

                Comment


                • #23
                  Thank you very much, Joro Kolev !!

                  That is great!

                  Comment


                  • #24
                    If we don't have to have the values returned in a variable, but nevertheless want the capacity for a long list of returned values, and to have those values accessible to programming, what about instead making them available as a sequentially named list of r-class scalars or locals, say r(v1), r(v2), ..., r(r(r))? Here's a simple illustration of what I mean, using the Mata code that FernandoRios provided as the means to do the heavy lifting. I tried a little experimenting, and this seems to be a fast approach. I presume that instead of returning r-class results, one could instead give Mata a stub name for a series of locals, and have it directly put the values into locals named stub1, stub2, etc., but I didn't try that. I don't believe (?) there are any short limits on the number of r-class results one can return or locals one can create and use, so I'd think that this kind approach would work for the current purposes.
                    Code:
                    cap prog drop lv2
                    program define lv2 , rclass
                        syntax varname
                        mata: y = st_data(.,"`varlist'")
                        mata: y = sort(y,1)
                        mata: info = panelsetup(y, 1)
                        mata: y=y[info[,1],]
                        mata: st_local("nval", strofreal(rows(y)))
                       // Clumsy <grin> return of values from Mata.
                        tempname temp
                        forval i = 1/`nval' {
                           mata: st_numscalar("`temp'", y[`i'])
                           return scalar v`i' = `temp'
                        }
                        return scalar r = `nval'    
                    end
                    // Demo
                    clear
                    set seed 76545
                    set obs 1000
                    gen x = ceil(runiform() * 10000)
                    lv2 x
                    di r(r) " different values found"
                    forval i = 1/`r(r)' {
                       display r(v`i')
                    }
                    Last edited by Mike Lacy; 07 Jun 2021, 18:47.

                    Comment

                    Working...
                    X