I am trying to make something like -levelsof- saving the levels of a variable, but not in a local, rather in a another variable.

Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#16

18 Apr 2021, 08:57

Thank you, Joro Kolev.

Actually, I was referring to an example using your final code to solve a problem as the one you presented in #3.

I think this would be great for Stata users to know because many of us probably have already came across such a problem of Stata hitting the limits when using trying to define a local for a variable that contains too many different values.

As you said, the example from Ali in #2 is not the issue you solved.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#17

18 Apr 2021, 14:54

Nick and Otavio Conceicao , the application that I have in mind goes like this. I have the following data:

Code:

. sysuse auto, clear (1978 Automobile Data) . recode rep (1 = 12) (2 = 20) (3=28) (4=29) (5=47) (. =60) (rep78: 74 changes made) . tab rep Repair | Record 1978 | Freq. Percent Cum. ------------+----------------------------------- 12 | 2 2.70 2.70 20 | 8 10.81 13.51 28 | 30 40.54 54.05 29 | 18 24.32 78.38 47 | 11 14.86 93.24 60 | 5 6.76 100.00 ------------+----------------------------------- Total | 74 100.00

Now imagine that I want to loop through the levels of rep, which are unevenly spaced. The current way how to do this in Stata would be

Current way

Code:

. levelsof rep, local(reps) 12 20 28 29 47 60 . foreach l of local reps { 2. dis `l' 3. } 12 20 28 29 47 60

The "new way" which I am proposing goes as follows:
"New" way

Code:

. levelstovar varlevrep = rep . count if !missing(varlevrep) 6 . forvalues i = 1/`r(N)' { 2. dis varlevrep[`i'] 3. } 12 20 28 29 47 60

So here the Old way and the New way give the same results.

However now imagine that I have to loop through the levels of a variable which has hundreds of thousands or millions of levels. Then the Old way hits into the Stata limits, and hits into these limits very fast if the user is using Intercooled Stata, as we discovered for the old version of the user contributed function from -egenmore-, -egen, xtile()- on this thread here:
https://www.statalist.org/forums/for...d-to-not-occur
(this is an issue that does not exist anymore, the author -egen, xtile()- Ulrich Kohler rewrote the function in such a way not to use anymore -levels- and the function now works for any number of levels).

The moral of the story here is that one very generally cannot use the -levelsof- Old way to loop through unevenly spaced levels of a variable in programmes that are either
1) to be used by other people, and you do not know what kind of Stata the other person has, on on what data the other person will try your command.
2) or you are using it yourself, but in advance you know that you need it for a variable which has too many levels to be accommodated by -levelsof-.

So this is the problem that I am trying to solve here.

Originally posted by Nick Cox View Post

I already answered #9 in #7 to some extent.

I can't see that your desiderata are in general consistent.

More crucially, I don't yet see how such a variable would be used in a way that helps more than any existing approach.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#18

18 Apr 2021, 15:39

As in #7 I generally recommend using egen's group() function, as at https://www.stata.com/support/faqs/d...-with-foreach/ --- that is, whenever something like statsby, collapse or rangestat (SSC) doesn't loop over levels automatically. .
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#19

18 Apr 2021, 16:18

I know, Nick. For almost 20 years I have been following your recommendation of mapping the unequally spaced levels firstly to equally spaced levels through -egen, group()- and then looping through the equally spaced with -forvalues- (Method 1 in your reference below).

This Method 1 approach has the advantage that it results in easy to write loops, easy to read loops, and it is pretty hard to mess things up using this approach.

However, and I learnt this relatively recently in the last year or so, there is a catch. The catch is that the loops resulting from Method 1 are very very slow.

So this is the interesting new point that came up in your post here, and which I have been studying in the last year or so: For example -statsby- beats in terms of speed Method 1 by something like 100 times in the experiments that I have done. There is a way how to write loops which is even faster than -statsby-, and this way is what motivates the topic of this thread.

Originally posted by Nick Cox View Post

As in #7 I generally recommend using egen's group() function, as at https://www.stata.com/support/faqs/d...-with-foreach/ --- that is, whenever something like statsby, collapse or rangestat (SSC) doesn't loop over levels automatically. .
Comment
Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#20

19 Apr 2021, 09:09

Thank you very much for presenting the application you have in mind, Joro Kolev !

Best,

Otavio
Comment
Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#21

05 Jun 2021, 17:33

Dear Joro Kolev ,

I was wondering whether it is possible to adapt your 'levelstovar' command to replicate the same results with a string variable (e.g., variable 'make' in the auto dataset) instead of a numeric variable.

It would be a valuable contribution!
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#22

05 Jun 2021, 22:26

Thank you for the suggestion, Otavio. This passed through my mind at some point, and I will make -levelstovar- at some point accommodate string variables too.

For the time being -- I do not know what application exactly you have on your mind -- note that string variables and nicely labelled numerical variables are almost perfect substitutes for all practical purposes.

So the following might do the trick for you (I am attaching the current version of -levelstovar- to this message:

Code:

. sysuse auto, clear (1978 Automobile Data) . egen nummake = group(make), label . levelstovar mymake = nummake . label values mymake nummake . list make nummake mymake in 1/7 +-----------------------------------------------+ | make nummake mymake | |-----------------------------------------------| 1. | AMC Concord AMC Concord AMC Concord | 2. | AMC Pacer AMC Pacer AMC Pacer | 3. | AMC Spirit AMC Spirit AMC Spirit | 4. | Buick Century Buick Century Audi 5000 | 5. | Buick Electra Buick Electra Audi Fox | |-----------------------------------------------| 6. | Buick LeSabre Buick LeSabre BMW 320i | 7. | Buick Opel Buick Opel Buick Century | +-----------------------------------------------+

What I did was to move the string variable make into a numeric variable nummake which is nicely labelled, and then I used -levelstovar- on the numeric nummake, and I reapplied the label for the values.

Originally posted by Otavio Conceicao View Post

Dear Joro Kolev ,

I was wondering whether it is possible to adapt your 'levelstovar' command to replicate the same results with a string variable (e.g., variable 'make' in the auto dataset) instead of a numeric variable.

It would be a valuable contribution!

Attached Files

levelstovar.ado (1.4 KB, 2 views)
Comment
Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#23

07 Jun 2021, 08:16

Thank you very much, Joro Kolev !!

That is great!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#24

07 Jun 2021, 18:44

If we don't have to have the values returned in a variable, but nevertheless want the capacity for a long list of returned values, and to have those values accessible to programming, what about instead making them available as a sequentially named list of r-class scalars or locals, say r(v1), r(v2), ..., r(r(r))? Here's a simple illustration of what I mean, using the Mata code that FernandoRios provided as the means to do the heavy lifting. I tried a little experimenting, and this seems to be a fast approach. I presume that instead of returning r-class results, one could instead give Mata a stub name for a series of locals, and have it directly put the values into locals named stub1, stub2, etc., but I didn't try that. I don't believe (?) there are any short limits on the number of r-class results one can return or locals one can create and use, so I'd think that this kind approach would work for the current purposes.

Code:

cap prog drop lv2 program define lv2 , rclass syntax varname mata: y = st_data(.,"`varlist'") mata: y = sort(y,1) mata: info = panelsetup(y, 1) mata: y=y[info[,1],] mata: st_local("nval", strofreal(rows(y))) // Clumsy <grin> return of values from Mata. tempname temp forval i = 1/`nval' { mata: st_numscalar("`temp'", y[`i']) return scalar v`i' = `temp' } return scalar r = `nval' end // Demo clear set seed 76545 set obs 1000 gen x = ceil(runiform() * 10000) lv2 x di r(r) " different values found" forval i = 1/`r(r)' { display r(v`i') }

Last edited by Mike Lacy; 07 Jun 2021, 18:47.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment