Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to split a local, into shorter locals?

    Good morning,

    I would like to know how to split a local that is too long (for some purpose) into a couple of shorter locals? I guess there is some macro extended function, but the help of the macro extended functions is a bit opaque, I read it for more than half an hour, and still could not figure out how to do this.

    The motivation for the question is from the task in this thread here:

    https://www.statalist.org/forums/for...other-variable

    I want to check whether any value of var1 appears anywhere in var2. So my solution is to :

    Code:
     . levelsof var2, local(second)  separate(,)
    
    28,41,42,547  
    
    . gen coincide = inlist(var1,`second')
    this works for the small data example because there are only 4 distinct values in var2, but if there were 5000 distinct values, I would need to split my local 'second' into locals that are shorter than 254 elements, because this is the maximum arguments that the -inlist- function can take.

    So what are the macro extended functions relevant here, and how do I use them?

    In short imagine that I have a local constituting of 5000 elements, and I want to split it in a couple of locals all shorter than 254 elements?




  • #2
    I am not aware of a function that does what you ask for. I would look for such a function in

    Code:
    help macro lists
    because, concerning terminology, I do not think there are "elements" in a local macro. A local macro is just one string; that string might consist of numbers, separated by comma, or it might consist of characters not separated at all. I think of this as a general parsing problem and (some of) the relevant tools are

    Code:
    help tokenize
    help gettoken
    help foreach
    help while
    help mata tokenget()
    Best
    Daniel

    Comment


    • #3
      Thank you, Daniel.

      I did figure out myself that what I want to do must be in the sections on macro extended functions concerning either -lists- or -parsing-. And the help is opaque, I skimmed through these section and I could not understand anything, because there are no examples, but just some very abstract talk what these things are supposed to do.

      I will check out -tokenize- and -gettoken-, thanks for this tip. (I doubt that there will be anything in -foreach- and -while-, I have read these helps plenty of times. )

      Comment


      • #4
        The references to foreach and while were rather implicit: your solution will almost certainly involve some sort of loop.

        Best
        Daniel

        Comment


        • #5
          There is syntax for splitting a macro into pieces. Its main use for me is splitting text for showing in chunks on graphs.

          Otherwise, my answer is no answer, or perhaps an answer you partly expect. Once you find yourself wanting to do this, you should suspect that local macros are not the tool of choice any way. The thread in question to me shouts merge and -- more generally -- a strategy based on using variables.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            There is syntax for splitting a macro into pieces. Its main use for me is splitting text for showing in chunks on graphs.

            Otherwise, my answer is no answer, or perhaps an answer you partly expect. Once you find yourself wanting to do this, you should suspect that local macros are not the tool of choice any way. The thread in question to me shouts merge and -- more generally -- a strategy based on using variables.
            I did not get you, Nick... Where do I read about this "syntax for splitting a macro into pieces" that you are mentioning? In which help file should I look?


            (Otherwise yes, the thread in question shouts -merge- and looping through observations to many people. )

            Comment


            • #7
              -tokenize- will split up a string, and can be used to put a big macro into a numbered list of small macros. One oddity I encountered is that -tokenize- treats the parsing character itself as a token or "element" , so I created a bit of code to exclude them. For future reference, perhaps someone can explain the sense of why -tokenize- would treat the parsing character as a token.
              Code:
              // Make a big macro to work with.
              sysuse auto
              local bigmac = ""
              forval i = 1/`=_N' {
                 local bigmac = "`bigmac'" + make[`i'] + ","
              }
              mac dir // visualize
              //
              tokenize "`bigmac'", parse(",") // `1', `2', ... will hold the elements of the big macro
              local done = 0
              local i = 0
              local count = 0
              while !`done' {  
                 local ++i
                  local next = "``i''"
                  local done = ("`next'" == "")
                 // Include elements that are not the parsing character itself.
                   if ("`next'" != ",") & (!`done') {
                     local ++count
                     local smallmac`count' = "`next'"
                  }  
              }
              //
              di "`count' elements made into small macros"
              forval i = 1/`count' {
                di "`smallmac`i''"
              }

              Comment


              • #8
                An alternative using extended macro functions:

                Code:
                #delimit ;
                local longmac "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25" ;
                local stop=0 ;
                local i=1 ;
                while `stop'!=1 { ;
                    local part`i' : piece `i' 200 of "`longmac'", nobreak ;
                    if `=length("`part`i''")' == 0 { ;
                        local stop=1 ;
                        local lasti=`i'-1 ;
                        } ;
                    else local ++i ;
                    } ;
                
                forval i=1/`lasti' {;
                    *remove comma if it is at the end of a line;
                    if substr("`part`i''", -1, .)=="," local part`i' = substr("`part`i''", 1, `=`=length("`part`i''")'-1' ) ;
                    *display inlist;
                    noi di "inlist(varname, `part`i'')";
                    };
                Stata/MP 14.1 (64-bit x86-64)
                Revision 19 May 2016
                Win 8.1

                Comment


                • #9
                  #6 Carole J. Wilson is showing this syntax in her code in #8. It is documented under

                  Code:
                  help extended fcn
                  which can be reached directly or from

                  Code:
                  help macro

                  Comment


                  • #10
                    On the chance that we're dealing with an example of the XY problem here, I want to return to post #1, where Joro suggests that his objective is to use the macro segements in the inlist() function.

                    If the eventual objective of using inlist() is to determine if a value appears in a list of values, consider instead the following code using macro list as Daniel referenced in post #2.
                    Code:
                    . set obs 1000
                    number of observations (_N) was 0, now 1,000
                    
                    . generate var2 = _n+999
                    
                    . quietly levelsof var2, local(second)
                    
                    . // is 1234 in the list?
                    . local want 1234
                    
                    . local have : list second & want
                    
                    . display "want `want' have `have'"
                    want 1234 have 1234
                    
                    . // is 2345 in the list?
                    . local want 2345
                    
                    . local have : list second & want
                    
                    . display "want `want' have `have'"
                    want 2345 have
                    Last edited by William Lisowski; 22 Jan 2019, 10:20.

                    Comment


                    • #11
                      William, my question was how can I extract a sub macro from a bigger macro, or in other words, how can I split a big macro into small macros.

                      I was asking whether there is an extended macro function such as (imaginary syntax follows, there is no such thing):

                      local submacroname : submacro 1/244 bigmacroname , parse("pchars")

                      which imaginary function would take the bigmacroname and extract from it the submacro consisting of the first 244 tokens of bigmacroname.

                      The answer that Nick, Mike and Carole gave me was "no, there is no such a thing, and you need to write a loop to do this piece by piece" or "token by token". Mike and Carole provided code, which looks to me complicated and I need to study it more (probably they are treating the problem at grater generality).

                      For the time being I wrote my own loop for doing this, and along the way verified the observation by Mike Lacy that -tokenize- behaves in unexpected ways. When the parsing character is space, tokenize disregards it, however when the parsing character is comma, tokenize puts the commas in the positional locals. E.g., the sequence a b after tokenize sends a to `1' and b to `2', however the sequence a,b after tokenize sends a to `1' and , (the comma) to `2' and b to `3'.


                      Code:
                      . sysuse auto, clear
                      (1978 Automobile Data)
                      
                      . levelsof price, local(P) separate(,)
                      3291,3299,3667,3748,3798,3799,3829,3895,3955,3984,3995,4010,4060,4082,4099,4172,4181,4187,4195,4296,4389,4424,4425
                      > ,4453,4482,4499,4504,4516,4589,4647,4697,4723,4733,4749,4816,4890,4934,5079,5104,5172,5189,5222,5379,5397,5705,5
                      > 719,5788,5798,5799,5886,5899,6165,6229,6295,6303,6342,6486,6850,7140,7827,8129,8814,9690,9735,10371,10372,11385,
                      > 11497,11995,12990,13466,13594,14500,15906
                      
                      . tokenize "`P'", parse(",")
                      
                      . forvalues i=1/7 {
                        2. local cumul = "`cumul'" + "``i''"
                        3. }
                      
                      . dis inlist(3291,`cumul')
                      1
                      
                      . dis inlist(3798,`cumul')
                      0

                      Comment


                      • #12
                        Mike Lacy and anyone interested

                        The definition of tokens in Stata (with nothing else said) is that they are delimited by white space, except that double quotes and compound double quotes bind into tokens (meaning that white space inside such tokens is treated as literal characters, not delimiters). That's a little clumsy as a definition but gives some much needed flexibility. Hence if we apply tokenize to simple input we get results that are often convenient. In what follows, only the output of macro list relevant to my examples is preserved:

                        Code:
                        . tokenize "I love Stata"
                        
                        . mac li
                        _3:             Stata
                        _2:             love
                        _1:             I
                        
                        . tokenize `" "I love Stata" "I have used Excel in my past" "'
                        
                        . mac li
                        _2:             I have used Excel in my past
                        _1:             I love Stata
                        
                        . tokenize "1, 2, 3", parse(,)
                        
                        . mac li
                        _5:             3
                        _4:             ,
                        _3:             2
                        _2:             ,
                        _1:             1
                        I agree that it's a little odd at first sight that when you are allowed to extend the allowing parsed characters, those characters end up as local macros too. The rationale is that they might have meaning to you! The responsibility is put on the programmer to ignore them if (and only if) they should indeed be ignored.

                        As an experiment consider the result of

                        Code:
                        tokenize "Some text, more text; yet more", p(, ;)
                        Here we're closer to parsing ordinary language, where the different punctuation signs don't have equal import. So, often you would have to add your own code based on what you want to ignore and/or what you want to do. Stata can't, and doesn't want to, decide that for you.



                        Comment


                        • #13
                          Joro, I recognized that in post #1 you indeed asked

                          how to split a local that is too long (for some purpose) into a couple of shorter locals
                          and you went on to explain

                          I would need to split my local 'second' into locals that are shorter than 254 elements, because this is the maximum arguments that the -inlist- function can take.
                          That is true if you're restricted to using inlist(), but inlist() is not the only tool at your disposal, and you have not shown anything that suggests inlist() is required for your purposes.

                          The objective of my post #10 was to point out that the macro list extended macro function allows the user to look up a value in a macro with more than 254 elements.

                          There are, we learn, people who follow Statalist to improve their knowledge of Stata. I was one of them once, and for that matter, still am, and it was through a reply on Statalist that I learned of the macro list extended macro functions.

                          Consequently, I thought it important that the alternative to using inlist to look up a value in a list be made explicit, and in general, to highlight the capabilities of the macro list extended macro function, which extend far beyond looking up items in a list. In particular, the tools for creating the union or intersection of two lists have been invaluable to me.

                          So, to anyone reading this, if you haven't done so already, do take a look at the output of help macro list. You may not need it now, but it is a good tool to know about, and it's sort of buried two clicks deep under help macro.

                          Comment


                          • #14
                            While Nick's thoughts on the behavior of tokenize are convincing, I would still argue that its behavior is inconsistent in that white space is treated differently from other parsing characters. In fact, white space is never treated as a parsing character (as defined in Mata; see below). White space is either ignored completely, in which case it does not show up in separate locals, or it is treated as literal character. To do the latter, you must specify at least one parsing character.

                            Mata's tokenget() differentiates between white space characters (which might be any characters, not just literal white space) from parsing characters; the former are ignored while the latter are used as delimiters and show up as separate tokens. tokenize lacks this differentiation and mixes both in the parse() option. The latter defines parsing characters when fed with characters other than white space and treats white space as a literal character in that case.

                            I tend to think of

                            Code:
                            tokenize "some string" , parse( ,)
                            in Stata as being equivalent to

                            Code:
                            t = tokeninit(" ", ",")
                            ...
                            in Mata, when it should be

                            Code:
                            t = tokeninit("", (" ", ","))
                            ...
                            that is, parse on spaces and commas.; tokenize cannot do this.

                            Also,

                            Code:
                            tokenize "some string" , parse(,) // note omitted white space
                            is the equivalent to

                            Code:
                            t = tokeninit("", ",")
                            ...
                            Best
                            Daniel
                            Last edited by daniel klein; 22 Jan 2019, 14:26. Reason: reference to Nick Cox disappeared when editing the post

                            Comment


                            • #15
                              StataCorp learn from their experience. They also don't change commands unless it is needed. So, my guess is that tokenget() is more subtle partly because its design benefits from more programming experience and partly because that is the way that Stata is changing long-term. If you need and want lower-level tools, they are increasingly likely to be in Mata.

                              I've got a dim recollection of suggesting to StataCorp several years ago that tokenize be extended to allow specification of a stub, so that its results weren't inevitably in local macros 1, 2, 3, ... but in say part1, part2, part3, ... -- and also to allow specifying that non-space parsing characters be discarded and not put in a local macro by themselves. The first little idea was written up intknz (SSC).

                              Comment

                              Working...
                              X