Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Better parsing results with string vector?

    Part of a program I'm writing needs a parsing function to separate code-based data requests (delimited by ","). While I've been able to provide a function that produces locals for each request, I understand there is a simpler way using either -tokenget- or -tokengetall-. However, using that approach, the advantages of a resulting string vector are outweighed by my lack of knowledge in why and how to call them from the containing program. I've read the -tokenget- section of the Stata manual, and haven't been able to reproduce content that gives the (tokens[1], tokens[2], ...) result.

    From the below code, the data strings I need are simply the locals, manually displayed at the bottom. How can I go about using the token-based commands in future functions, and how do I call the results outside the function? Any other immediate thoughts?
    Code:
    local abc "SMS1,SMS2"
    
    mata:
    function blstokenize(string scalar txt, real scalar snum) {
        string scalar series, y1
        real scalar z1, z2, ct
        ct=0
        st_strscalar("txt",txt)
        
        while(ct<snum) {
            ct=1+ct
            z1=strpos(st_strscalar("txt"),",")
            if(z1>0) {
                st_strscalar("y1",substr(st_strscalar("txt"),1,z1))
                z2=z1-1
                st_strscalar("series",substr(st_strscalar("y1"),1,st_numscalar("z2")))
            }
            else {
                st_strscalar("y1",substr(st_strscalar("txt"),1,strlen(st_strscalar("txt"))))
                z2=0
                st_strscalar("series",st_strscalar("y1"))
            }
            st_local("series"+strofreal(ct),st_strscalar("series"))
            st_strscalar("txt",ustrregexra(st_strscalar("txt"),st_strscalar("y1"),""))
        }
    }
    
    blstokenize(st_local("abc"),2)
    display(st_local("series1"))
    display(st_local("series2"))
    
    end

  • #2
    I too, found the token function annoying.
    So I wrote my own function using regex, which is part of the package lmatrixtools.

    Code:
    . local abc "SMS1,SMS2,SMS3,SMS4"
    
    . mata mata clear
    
    . mata:
    ------------------------------------------------- mata (type end to exit) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    :     string rowvector nhb_muf_tokensplit(string scalar txt, string scalar delimiter)
    >     {
    >         string vector  row
    >         string scalar filter
    >         row = J(1,0,"")
    >         filter = sprintf("(.*)%s(.*)", delimiter)
    >         while (regexm(txt, filter)) {
    >             txt = regexs(1)
    >             row = regexs(2), row
    >         }
    >         row = txt, row
    >         return(row)
    >     }
    
    : 
    : snum = 2
    
    : nhb_muf_tokensplit(st_local("abc"), ",")[1..snum]
              1      2
        +---------------+
      1 |  SMS1   SMS2  |
        +---------------+
    
    : end
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    However, I think this code is more efficient
    Code:
    . local abc "SMS1,SMS2,SMS3,SMS4"
    
    . mata mata clear
    
    . mata:
    ------------------------------------------------- mata (type end to exit) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    : 
    : function blstokenize(string scalar txt, real scalar snum) {
    >         string colvector scv
    >         return(select(scv=(tokens(txt, ",")'), scv :!= ",")'[1..snum])
    > }
    
    : 
    : blstokenize(st_local("abc"),2)
              1      2
        +---------------+
      1 |  SMS1   SMS2  |
        +---------------+
    
    : end
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Kind regards

    nhb

    Comment


    • #3
      Originally posted by Niels Henrik Bruun View Post
      I too, found the token function annoying.
      So I wrote my own function using regex, which is part of the package lmatrixtools.

      [...]

      The problem with tools based on regex*() or parsing one character (i.e., byte) at a time, as suggested in #1, is that they cannot readily handle binding by quotes, parentheses, or the like. Mata's tokenget() machinery has all these gimmicks implemented and I have hardly had any problems with them.

      The regex*() approach might require special care if the delimiter needs to be escaped, e.g., ., ?, etc.



      Originally posted by Niels Henrik Bruun View Post
      However, I think this code is more efficient
      Code:
      . local abc "SMS1,SMS2,SMS3,SMS4"
      
      . mata mata clear
      
      . mata:
      ------------------------------------------------- mata (type end to exit) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      :
      : function blstokenize(string scalar txt, real scalar snum) {
      > string colvector scv
      > return(select(scv=(tokens(txt, ",")'), scv :!= ",")'[1..snum])
      > }
      
      :
      : blstokenize(st_local("abc"),2)
      1 2
      +---------------+
      1 | SMS1 SMS2 |
      +---------------+
      
      : end
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      The code will fail if there are fewer than snum columns.


      I do not fully understand what #1 is looking for in terms of the interplay between Stata and Mata.

      EDIT/Added:

      From the information in #1, a one-liner might suffice:

      Code:
      tokens(subinstr(st_local("abc"), ",", " ", .))[1..2]
      Last edited by daniel klein; 22 Jun 2023, 01:41.

      Comment


      • #4
        This is a nice chunk of Mata, and your answers I suppose get me to where I need, most basically that my preference for the macro solution stems from most of the program running in Stata, making the post title a mite uninformed. So, there is no Stata that cannot be accomplished using Mata, correct?

        Comment


        • #5
          Originally posted by Eric Makela View Post
          So, there is no Stata that cannot be accomplished using Mata, correct?
          Yes. At least in the sense that Mata (a) is broader programming language and (b) can call stata() if necessary or convenient.

          Likewise, what you need (which is still not completely clear to me) could probably be done in Stata. Consider

          Code:
          . local abc "SMS1,SMS2"
          
          .
          . local snum 2
          
          .
          . forvalues i = 1/`snum' {
            2.    
          .     gettoken series`i' abc : abc , parse(",")
            3.     gettoken comma     abc : abc , parse(",")
            4.    
          . }
          
          .
          . display "series1: `series1'"
          series1: SMS1
          
          . display "series2: `series2'"
          series2: SMS2
          
          .
          end of do-file
          or

          Code:
          . 
          . local commas_removed : subinstr local abc "," " " , all
          
          . tokenize "`commas_removed'"
          
          . 
          . display "`1'"
          SMS1
          
          . display "`2'"
          SMS2
          
          . 
          end of do-file
          Last edited by daniel klein; 23 Jun 2023, 05:51.

          Comment

          Working...
          X