Tokenize and trimming

Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#1

Tokenize and trimming

13 Jul 2016, 03:48

I need to feed different strings to strpos using a loop to create variables. I have found an old question on Statalist regarding the tokenize command's odd behaviour with the parse character, and I could circumvent it using the idea given there. In my humble opinion it's still a serious bug, but you may not need to agree (again an heresy, but I would highly recommend to Stata developers the functionality of SAS's scan function). However, as I see it trims the leading spaces off, and precisely that's why I would like to use a parse character different from blank space. May I ask your help how to keep all blanks in the tokens?

Code:

local regexp = "abc|a.b.| ab | ba| cd |.cd " tokenize `regexp', parse("|") while "`*'" ~= "" { disp "*`1'*" macro shift }

Thanks indeed!

Kazi

Last edited by Kazi Bacsi; 13 Jul 2016, 04:03. Reason: Formatting...
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

13 Jul 2016, 07:12

Welcome to Statalist!

Perhaps if you described more fully the ultimate problem you are trying to solve - creating variables using strpos - someone here could suggest an approach that capitalizes on Stata's capabilities and avoids the problem you are experiencing with tokenize.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#3

13 Jul 2016, 07:52

In Mata this is easy

Code:

version 12.1 mata : string rowvector mytokenize(string scalar s) { transmorphic scalar t t = tokeninit("", "|") tokenset(t, s) return(tokengetall(t)) } end mata : mytokenize("abc|a.b.| ab | ba| cd |.cd ")

and I am sure you can almost as easily replicate the referenced SAS function (although I a have never used SAS nor looked into the function deeper).

I can see that you probably do not wish to program these details. Tell us more about the problem and someone might invest a little more time for a tailored solution.

Best
Daniel
Comment
Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#4

13 Jul 2016, 08:28

Dear William and Daniel!

Thank you for your kind and prompt answer! The main goal is to extract numerical identifiers from string variables that can be found after these pieces of text. So I search with strpos the location of the relevant part, I create a substring starting from this position with substr, and with regexm I'm looking for digits. To do it in a loop, I thought it was a good idea to tokenize the string containing the "keywords" and give it to strpos as an argument. But the leading blank is trimmed, so I get false matches...

For example:

Code:

gen a = "some random text and the identifier comes ba 1234" gen b = strpos(a, " ba") gen c = substr(a, b, .) gen d = regexs(1) if regexm(c, "([0-9]+)")

Kazi
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#5

13 Jul 2016, 08:39

Not entirely clear to me.

How do you know what the relevant part is, that you then search with strpos? Why do you need the substrings, anyway. The example above boils down to

Code:

gen a = "some random text and the identifier comes ba 1234" generate d = regexs(1) if regexm(a, "([0-9]+)")

This assumes the identifier is the first and only numeric part in the string variable.

Maybe you could show an example illustrating the problem more clearly?

Anyway, you should have a look at the split command and also at moss (SSC).

Best
Daniel
Comment

Kazi Bacsi

Join Date: Jul 2016
Posts: 59

13 Jul 2016, 08:58

Imagine identifiers called POB and ZIP:

raw string	POB	ZIP
"Company A is located in 1234 London, POB 98765, number of employees 13"	98765	.
"It was said that Factory EPOB was fined for 100 dollars, its identifiers are: POB 6543, ZIP 7890"	6543	7890
"I saw random letters on Statalist: ZPOBZIP, but I wanted to give 3 examples instead"	.	.

By having a leading blank, you can avoid a word ending with POB or ZIP by chance. Using split you can split string variables, but you can't slice up strings into parts to feed them into local macros if I'm not mistaken.
Thanks again for your help!

Kazi

Last edited by Kazi Bacsi; 13 Jul 2016, 09:00.

Comment

Daniel Bela

Join Date: Apr 2014

Posts: 246
#7

13 Jul 2016, 09:07

How about extending Daniel's example code to

Code:

generate ZIP=regexs(1) if regexm(a," ZIP ([0-9]+)") generate POB=regexs(1) if regexm(a," POB ([0-9]+)")

Or is this too simple? I agree with Daniel: Why bother with the position? All you need is to extract a part of a string following a fixed (regular) expression.

Regards
Bela
1 like
Comment

daniel klein

Join Date: Mar 2014
Posts: 3859

13 Jul 2016, 09:08

But this still is just

Code:

clear
inp str244 s
"Company A is located in 1234 London, POB 98765, number of employees 13"
"It was said that Factory EPOB was fined for 100 dollars, its identifiers are: POB 6543, ZIP 7890"
"I saw random letters on Statalist: ZPOBZIP, but I wanted to give 3 examples instead"
end

list

generate POB = regexs(1) if regexm(s, " POB ([0-9]+)")
generate ZIB = regexs(1) if regexm(s, " ZIP ([0-9]+)")

list

Best
Daniel

btw. see dataex (SSC) for the preferred way to show examples here on Statalist.

Comment

daniel klein

Join Date: Mar 2014

Posts: 3859
#9

13 Jul 2016, 09:09

Daniel (Bela) was quicker.
Comment
Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#10

13 Jul 2016, 09:19

Thanks, you're perfectly right that the substr part is redundant (I was keeping it for sight check). The problem is that I have 6 different kewyords to search. One ugly way to solve it is by copy-pasting 6 times these lines. I thought the other, more elegant way is to save a string in a local macro containing all the keywords, chop it using tokenize, and give it to regexm. But if the leading blank is trimmed from the individual tokens, I can't get only those matches, where the keyword is not at the end of a random word instead.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3859

#11

13 Jul 2016, 09:58

Code:

local keywords POB ZIP FOO BAR 
foreach kw of local keywords {
    generate `kw' = regexs(1) if regexm(a, " `x' ([0-9]+)")
}

Best
Daniel

Comment

Daniel Bela

Join Date: Apr 2014
Posts: 246

#12

13 Jul 2016, 10:11

Or, making every single regex explicit, something like this:

Code:

local searchstrings `"pob=" POB ([0-9]+)"|zip=" ZIP ([0-9]+)"|abc=" a([0-9]+)" "'

while (!missing(`"`searchstrings'"')) {
    gettoken entry searchstrings : searchstrings , parse("|") quotes
    if (`"`entry'"'=="|") continue
    display as text `"working on search element: {it:`entry'}"'
    gettoken varname regex : entry , parse("=") quotes
    local regex=substr(`"`regex'"',2,.)
    display `"varname to be generated: {it:`varname'}"'
    display `"regex to be used: |{it:`regex'}|"'
    generate `varname'=regexs(1) if regexm(stringvar,`regex')
}

Or in other words: Maybe your original plan simply did not work out because of quoting? Also, I would go with -gettoken- instead of -tokenize-, but this may be a matter of personal liking.

Regards
Bela

Comment

Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#13

13 Jul 2016, 10:23

I suppose Daniel Klein didn't get my point with the problem of the leading and trailing blanks, but based on his solution and a Statalist thread I could find my own one!

Code:

local kw `" " POB" "ZIP " " FOO " " BAR ""' foreach k of local kw { disp "*`k'*" }

I will check gettoken as well, thanks for the hint!

Thank you guys!

Just as a disclaimer, I maintain that it's a bug in tokenize that it treats the parsing characters as separate tokens and this trimming thing is also rather odd...
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#14

13 Jul 2016, 10:41

Just as a disclaimer, I maintain that it's a bug in tokenize that it treats the parsing characters as separate tokens and this trimming thing is also rather odd...

From the pedantry corner: I understand the term "bug" to mean that a program does not perform in accordance with its description. But -tokenize- does work exactly as described in the user's manual. So I think the way to describe Kazi's objections is as a possible design defect, not a bug.

More substantively, -tokenize- is a very old command. It goes back to at least version 4, and, for all I know, earlier. In those early days, the -syntax- command did not yet exist (or if it did, I was unaware of it). Programmers often had to do a lot of work with -tokenize- to parse the command lines for programs they wrote. For that matter, we didn't have -foreach- back then, and loops were often done by -tokenize-ing a list, and then using a -while- structure in conjunction with a counter. In those contexts, the design decision to strip blanks but retain other parsing characters was actually quite helpful nearly all the time. In fact, in those contexts, putting the blanks into separate tokens or retaining them as part of the tokens would have been extremely inconvenient. Today -tokenize- is less used for these older purposes, and perhaps a case can be made for a new command that works more along the lines that Kazi would like. Even if that is true, I would oppose changing the behavior of -tokenize- itself, to preserve compatibility for older programs that rely on its current behavior.
1 like
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#15

13 Jul 2016, 10:57

Of course this can be called "rather odd", but Stata always strips off leading whitespace when saving text to a local macro, unless this text is quoted.

If you change your original code to use quoting around each element, I think it does what you originally wanted it to:

Code:

local regexp `""abc"|"a.b."|" ab "|" ba"|" cd "|".cd ""' tokenize `"`regexp'"', parse("|") while `"`*'"' ~= `""' { disp `"*`1'*"' macro shift }

The fact that the pipes in your example are parsed as tokens is documented. The PDF documentation manual states in [P] tokenize:

These examples illustrate that the quotes surrounding the string are optional; the space parsing
character is not saved in the numbered macros; nonspace parsing characters are saved in the numbered
macros together with the tokens being parsed; and more than one parsing character may be specified.

So, again, this is documented in the manual. I would not call this a bug. Anyways, with correct quoting in your original example, you could simply parse by spaces and are fine:

Code:

local regexp `""abc" "a.b." " ab " " ba" " cd " ".cd ""' tokenize `"`regexp'"' while `"`*'"' ~= `""' { disp `"*`1'*"' macro shift }

Anyways, Daniel Klein's and your -foreach- way of iterating through elements (using correct quoting!) is likely to be the most efficient way to achieve what you're after.

Regards
Bela
2 likes
Comment

Announcement