Better parsing results with string vector?

Eric Makela

Join Date: Aug 2022

Posts: 45
#1

Better parsing results with string vector?

21 Jun 2023, 13:54

Part of a program I'm writing needs a parsing function to separate code-based data requests (delimited by ","). While I've been able to provide a function that produces locals for each request, I understand there is a simpler way using either -tokenget- or -tokengetall-. However, using that approach, the advantages of a resulting string vector are outweighed by my lack of knowledge in why and how to call them from the containing program. I've read the -tokenget- section of the Stata manual, and haven't been able to reproduce content that gives the (tokens[1], tokens[2], ...) result.

From the below code, the data strings I need are simply the locals, manually displayed at the bottom. How can I go about using the token-based commands in future functions, and how do I call the results outside the function? Any other immediate thoughts?

Code:

local abc "SMS1,SMS2" mata: function blstokenize(string scalar txt, real scalar snum) { string scalar series, y1 real scalar z1, z2, ct ct=0 st_strscalar("txt",txt) while(ct<snum) { ct=1+ct z1=strpos(st_strscalar("txt"),",") if(z1>0) { st_strscalar("y1",substr(st_strscalar("txt"),1,z1)) z2=z1-1 st_strscalar("series",substr(st_strscalar("y1"),1,st_numscalar("z2"))) } else { st_strscalar("y1",substr(st_strscalar("txt"),1,strlen(st_strscalar("txt")))) z2=0 st_strscalar("series",st_strscalar("y1")) } st_local("series"+strofreal(ct),st_strscalar("series")) st_strscalar("txt",ustrregexra(st_strscalar("txt"),st_strscalar("y1"),"")) } } blstokenize(st_local("abc"),2) display(st_local("series1")) display(st_local("series2")) end
Tags: None

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 555

22 Jun 2023, 00:41

I too, found the token function annoying.
So I wrote my own function using regex, which is part of the package lmatrixtools.

Code:

. local abc "SMS1,SMS2,SMS3,SMS4"

. mata mata clear

. mata:
------------------------------------------------- mata (type end to exit) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
:     string rowvector nhb_muf_tokensplit(string scalar txt, string scalar delimiter)
>     {
>         string vector  row
>         string scalar filter
>         row = J(1,0,"")
>         filter = sprintf("(.*)%s(.*)", delimiter)
>         while (regexm(txt, filter)) {
>             txt = regexs(1)
>             row = regexs(2), row
>         }
>         row = txt, row
>         return(row)
>     }

: 
: snum = 2

: nhb_muf_tokensplit(st_local("abc"), ",")[1..snum]
          1      2
    +---------------+
  1 |  SMS1   SMS2  |
    +---------------+

: end
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

However, I think this code is more efficient

Code:

. local abc "SMS1,SMS2,SMS3,SMS4"

. mata mata clear

. mata:
------------------------------------------------- mata (type end to exit) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
: 
: function blstokenize(string scalar txt, real scalar snum) {
>         string colvector scv
>         return(select(scv=(tokens(txt, ",")'), scv :!= ",")'[1..snum])
> }

: 
: blstokenize(st_local("abc"),2)
          1      2
    +---------------+
  1 |  SMS1   SMS2  |
    +---------------+

: end
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Kind regards

nhb

Comment

daniel klein

Join Date: Mar 2014

Posts: 3843
#3

22 Jun 2023, 01:11

Originally posted by Niels Henrik Bruun View Post

I too, found the token function annoying.
So I wrote my own function using regex, which is part of the package lmatrixtools.

[...]

The problem with tools based on regex*() or parsing one character (i.e., byte) at a time, as suggested in #1, is that they cannot readily handle binding by quotes, parentheses, or the like. Mata's tokenget() machinery has all these gimmicks implemented and I have hardly had any problems with them.

The regex*() approach might require special care if the delimiter needs to be escaped, e.g., ., ?, etc.

Originally posted by Niels Henrik Bruun View Post

However, I think this code is more efficient

Code:

. local abc "SMS1,SMS2,SMS3,SMS4" . mata mata clear . mata: ------------------------------------------------- mata (type end to exit) ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- : : function blstokenize(string scalar txt, real scalar snum) { > string colvector scv > return(select(scv=(tokens(txt, ",")'), scv :!= ",")'[1..snum]) > } : : blstokenize(st_local("abc"),2) 1 2 +---------------+ 1 | SMS1 SMS2 | +---------------+ : end ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The code will fail if there are fewer than snum columns.

I do not fully understand what #1 is looking for in terms of the interplay between Stata and Mata.

EDIT/Added:

From the information in #1, a one-liner might suffice:

Code:

tokens(subinstr(st_local("abc"), ",", " ", .))[1..2]

Last edited by daniel klein; 22 Jun 2023, 01:41.
1 like
Comment
Eric Makela

Join Date: Aug 2022

Posts: 45
#4

23 Jun 2023, 03:39

This is a nice chunk of Mata, and your answers I suppose get me to where I need, most basically that my preference for the macro solution stems from most of the program running in Stata, making the post title a mite uninformed. So, there is no Stata that cannot be accomplished using Mata, correct?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3843
#5

23 Jun 2023, 05:34

Originally posted by Eric Makela View Post

So, there is no Stata that cannot be accomplished using Mata, correct?

Yes. At least in the sense that Mata (a) is broader programming language and (b) can call stata() if necessary or convenient.

Likewise, what you need (which is still not completely clear to me) could probably be done in Stata. Consider

Code:

. local abc "SMS1,SMS2" . . local snum 2 . . forvalues i = 1/`snum' { 2. . gettoken series`i' abc : abc , parse(",") 3. gettoken comma abc : abc , parse(",") 4. . } . . display "series1: `series1'" series1: SMS1 . display "series2: `series2'" series2: SMS2 . end of do-file

or

Code:

. . local commas_removed : subinstr local abc "," " " , all . tokenize "`commas_removed'" . . display "`1'" SMS1 . display "`2'" SMS2 . end of do-file

Last edited by daniel klein; 23 Jun 2023, 05:51.
Comment

Announcement

Better parsing results with string vector?

Comment

Comment

Comment

Comment