How to split a local, into shorter locals?

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#1

How to split a local, into shorter locals?

22 Jan 2019, 04:04

Good morning,

I would like to know how to split a local that is too long (for some purpose) into a couple of shorter locals? I guess there is some macro extended function, but the help of the macro extended functions is a bit opaque, I read it for more than half an hour, and still could not figure out how to do this.

The motivation for the question is from the task in this thread here:

https://www.statalist.org/forums/for...other-variable

I want to check whether any value of var1 appears anywhere in var2. So my solution is to :

Code:

. levelsof var2, local(second) separate(,) 28,41,42,547 . gen coincide = inlist(var1,`second')

this works for the small data example because there are only 4 distinct values in var2, but if there were 5000 distinct values, I would need to split my local 'second' into locals that are shorter than 254 elements, because this is the maximum arguments that the -inlist- function can take.

So what are the macro extended functions relevant here, and how do I use them?

In short imagine that I have a local constituting of 5000 elements, and I want to split it in a couple of locals all shorter than 254 elements?
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3861
#2

22 Jan 2019, 04:42

I am not aware of a function that does what you ask for. I would look for such a function in

Code:

help macro lists

because, concerning terminology, I do not think there are "elements" in a local macro. A local macro is just one string; that string might consist of numbers, separated by comma, or it might consist of characters not separated at all. I think of this as a general parsing problem and (some of) the relevant tools are

Code:

help tokenize help gettoken help foreach help while help mata tokenget()

Best
Daniel
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

22 Jan 2019, 04:58

Thank you, Daniel.

I did figure out myself that what I want to do must be in the sections on macro extended functions concerning either -lists- or -parsing-. And the help is opaque, I skimmed through these section and I could not understand anything, because there are no examples, but just some very abstract talk what these things are supposed to do.

I will check out -tokenize- and -gettoken-, thanks for this tip. (I doubt that there will be anything in -foreach- and -while-, I have read these helps plenty of times. )
Comment
daniel klein

Join Date: Mar 2014

Posts: 3861
#4

22 Jan 2019, 05:33

The references to foreach and while were rather implicit: your solution will almost certainly involve some sort of loop.

Best
Daniel
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35727
#5

22 Jan 2019, 05:35

There is syntax for splitting a macro into pieces. Its main use for me is splitting text for showing in chunks on graphs.

Otherwise, my answer is no answer, or perhaps an answer you partly expect. Once you find yourself wanting to do this, you should suspect that local macros are not the tool of choice any way. The thread in question to me shouts merge and -- more generally -- a strategy based on using variables.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

22 Jan 2019, 07:36

Originally posted by Nick Cox View Post

There is syntax for splitting a macro into pieces. Its main use for me is splitting text for showing in chunks on graphs.

Otherwise, my answer is no answer, or perhaps an answer you partly expect. Once you find yourself wanting to do this, you should suspect that local macros are not the tool of choice any way. The thread in question to me shouts merge and -- more generally -- a strategy based on using variables.

I did not get you, Nick... Where do I read about this "syntax for splitting a macro into pieces" that you are mentioning? In which help file should I look?

(Otherwise yes, the thread in question shouts -merge- and looping through observations to many people. )
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

22 Jan 2019, 09:02

-tokenize- will split up a string, and can be used to put a big macro into a numbered list of small macros. One oddity I encountered is that -tokenize- treats the parsing character itself as a token or "element" , so I created a bit of code to exclude them. For future reference, perhaps someone can explain the sense of why -tokenize- would treat the parsing character as a token.

Code:

// Make a big macro to work with.
sysuse auto
local bigmac = ""
forval i = 1/`=_N' {
   local bigmac = "`bigmac'" + make[`i'] + ","
}
mac dir // visualize
//
tokenize "`bigmac'", parse(",") // `1', `2', ... will hold the elements of the big macro
local done = 0
local i = 0
local count = 0
while !`done' {  
   local ++i
    local next = "``i''"
    local done = ("`next'" == "")
   // Include elements that are not the parsing character itself.
     if ("`next'" != ",") & (!`done') {
       local ++count
       local smallmac`count' = "`next'"
    }  
}
//
di "`count' elements made into small macros"
forval i = 1/`count' {
  di "`smallmac`i''"
}

Comment

Carole J. Wilson

Join Date: Jan 2015
Posts: 932

22 Jan 2019, 09:11

An alternative using extended macro functions:

Code:

#delimit ;
local longmac "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25" ;
local stop=0 ;
local i=1 ;
while `stop'!=1 { ;
    local part`i' : piece `i' 200 of "`longmac'", nobreak ;
    if `=length("`part`i''")' == 0 { ;
        local stop=1 ;
        local lasti=`i'-1 ;
        } ;
    else local ++i ;
    } ;

forval i=1/`lasti' {;
    *remove comma if it is at the end of a line;
    if substr("`part`i''", -1, .)=="," local part`i' = substr("`part`i''", 1, `=`=length("`part`i''")'-1' ) ;
    *display inlist;
    noi di "inlist(varname, `part`i'')";
    };

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35727
#9

22 Jan 2019, 09:33

#6 Carole J. Wilson is showing this syntax in her code in #8. It is documented under

Code:

help extended fcn

which can be reached directly or from

Code:

help macro
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

22 Jan 2019, 10:08

On the chance that we're dealing with an example of the XY problem here, I want to return to post #1, where Joro suggests that his objective is to use the macro segements in the inlist() function.

If the eventual objective of using inlist() is to determine if a value appears in a list of values, consider instead the following code using macro list as Daniel referenced in post #2.

Code:

. set obs 1000 number of observations (_N) was 0, now 1,000 . generate var2 = _n+999 . quietly levelsof var2, local(second) . // is 1234 in the list? . local want 1234 . local have : list second & want . display "want `want' have `have'" want 1234 have 1234 . // is 2345 in the list? . local want 2345 . local have : list second & want . display "want `want' have `have'" want 2345 have

Last edited by William Lisowski; 22 Jan 2019, 10:20.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#11

22 Jan 2019, 12:06

William, my question was how can I extract a sub macro from a bigger macro, or in other words, how can I split a big macro into small macros.

I was asking whether there is an extended macro function such as (imaginary syntax follows, there is no such thing):

local submacroname : submacro 1/244 bigmacroname , parse("pchars")

which imaginary function would take the bigmacroname and extract from it the submacro consisting of the first 244 tokens of bigmacroname.

The answer that Nick, Mike and Carole gave me was "no, there is no such a thing, and you need to write a loop to do this piece by piece" or "token by token". Mike and Carole provided code, which looks to me complicated and I need to study it more (probably they are treating the problem at grater generality).

For the time being I wrote my own loop for doing this, and along the way verified the observation by Mike Lacy that -tokenize- behaves in unexpected ways. When the parsing character is space, tokenize disregards it, however when the parsing character is comma, tokenize puts the commas in the positional locals. E.g., the sequence a b after tokenize sends a to `1' and b to `2', however the sequence a,b after tokenize sends a to `1' and , (the comma) to `2' and b to `3'.

Code:

. sysuse auto, clear (1978 Automobile Data) . levelsof price, local(P) separate(,) 3291,3299,3667,3748,3798,3799,3829,3895,3955,3984,3995,4010,4060,4082,4099,4172,4181,4187,4195,4296,4389,4424,4425 > ,4453,4482,4499,4504,4516,4589,4647,4697,4723,4733,4749,4816,4890,4934,5079,5104,5172,5189,5222,5379,5397,5705,5 > 719,5788,5798,5799,5886,5899,6165,6229,6295,6303,6342,6486,6850,7140,7827,8129,8814,9690,9735,10371,10372,11385, > 11497,11995,12990,13466,13594,14500,15906 . tokenize "`P'", parse(",") . forvalues i=1/7 { 2. local cumul = "`cumul'" + "``i''" 3. } . dis inlist(3291,`cumul') 1 . dis inlist(3798,`cumul') 0
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35727
#12

22 Jan 2019, 12:53

Mike Lacy and anyone interested

The definition of tokens in Stata (with nothing else said) is that they are delimited by white space, except that double quotes and compound double quotes bind into tokens (meaning that white space inside such tokens is treated as literal characters, not delimiters). That's a little clumsy as a definition but gives some much needed flexibility. Hence if we apply tokenize to simple input we get results that are often convenient. In what follows, only the output of macro list relevant to my examples is preserved:

Code:

. tokenize "I love Stata" . mac li _3: Stata _2: love _1: I . tokenize `" "I love Stata" "I have used Excel in my past" "' . mac li _2: I have used Excel in my past _1: I love Stata . tokenize "1, 2, 3", parse(,) . mac li _5: 3 _4: , _3: 2 _2: , _1: 1

I agree that it's a little odd at first sight that when you are allowed to extend the allowing parsed characters, those characters end up as local macros too. The rationale is that they might have meaning to you! The responsibility is put on the programmer to ignore them if (and only if) they should indeed be ignored.

As an experiment consider the result of

Code:

tokenize "Some text, more text; yet more", p(, ;)

Here we're closer to parsing ordinary language, where the different punctuation signs don't have equal import. So, often you would have to add your own code based on what you want to ignore and/or what you want to do. Stata can't, and doesn't want to, decide that for you.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

22 Jan 2019, 13:44

Joro, I recognized that in post #1 you indeed asked

how to split a local that is too long (for some purpose) into a couple of shorter locals

and you went on to explain

I would need to split my local 'second' into locals that are shorter than 254 elements, because this is the maximum arguments that the -inlist- function can take.

That is true if you're restricted to using inlist(), but inlist() is not the only tool at your disposal, and you have not shown anything that suggests inlist() is required for your purposes.

The objective of my post #10 was to point out that the macro list extended macro function allows the user to look up a value in a macro with more than 254 elements.

There are, we learn, people who follow Statalist to improve their knowledge of Stata. I was one of them once, and for that matter, still am, and it was through a reply on Statalist that I learned of the macro list extended macro functions.

Consequently, I thought it important that the alternative to using inlist to look up a value in a list be made explicit, and in general, to highlight the capabilities of the macro list extended macro function, which extend far beyond looking up items in a list. In particular, the tools for creating the union or intersection of two lists have been invaluable to me.

So, to anyone reading this, if you haven't done so already, do take a look at the output of help macro list. You may not need it now, but it is a good tool to know about, and it's sort of buried two clicks deep under help macro.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3861
#14

22 Jan 2019, 13:50

While Nick's thoughts on the behavior of tokenize are convincing, I would still argue that its behavior is inconsistent in that white space is treated differently from other parsing characters. In fact, white space is never treated as a parsing character (as defined in Mata; see below). White space is either ignored completely, in which case it does not show up in separate locals, or it is treated as literal character. To do the latter, you must specify at least one parsing character.

Mata's tokenget() differentiates between white space characters (which might be any characters, not just literal white space) from parsing characters; the former are ignored while the latter are used as delimiters and show up as separate tokens. tokenize lacks this differentiation and mixes both in the parse() option. The latter defines parsing characters when fed with characters other than white space and treats white space as a literal character in that case.

I tend to think of

Code:

tokenize "some string" , parse( ,)

in Stata as being equivalent to

Code:

t = tokeninit(" ", ",") ...

in Mata, when it should be

Code:

t = tokeninit("", (" ", ",")) ...

that is, parse on spaces and commas.; tokenize cannot do this.

Also,

Code:

tokenize "some string" , parse(,) // note omitted white space

is the equivalent to

Code:

t = tokeninit("", ",") ...

Best
Daniel

Last edited by daniel klein; 22 Jan 2019, 14:26. Reason: reference to Nick Cox disappeared when editing the post
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35727
#15

23 Jan 2019, 01:54

StataCorp learn from their experience. They also don't change commands unless it is needed. So, my guess is that tokenget() is more subtle partly because its design benefits from more programming experience and partly because that is the way that Stata is changing long-term. If you need and want lower-level tools, they are increasingly likely to be in Mata.

I've got a dim recollection of suggesting to StataCorp several years ago that tokenize be extended to allow specification of a stub, so that its results weren't inevitably in local macros 1, 2, 3, ... but in say part1, part2, part3, ... -- and also to allow specifying that non-space parsing characters be discarded and not put in a local macro by themselves. The first little idea was written up intknz (SSC).
Comment

Announcement