built-in variables

William Lisowski

Join Date: Dec 2014

Posts: 10150
#16

16 Mar 2015, 20:59

Post #3 in this topic includes from Clyde Schecter a regular expression utilizing at least one feature not found in the FAQ, the use of an initial caret within a bracketed group to indicate the complement of the set of characters given.

Code:

[^0-9]

At the end of the topic, Clyde wrote in response to my question

When I need to refresh my memory I just Google regular expressions POSIX ...

With that in mind, I reread help regexm and found

Regular expression syntax is based on Henry Spencer's NFA algorithm, and this is nearly identical to the POSIX.2 standard.

which further work with Google confirms is far beyond the limited description given in the FAQ. Prior to that, given the material in the FAQs turned up by search regular expression I'd mistakenly assumed the POSIX.2 standard was an earlier form of the current POSIX regular expression standard. That was the only way I could read the FAQ as consistent with help regexm. I believe I played a bit with regexm after this discovery and confirmed that all my familiar friends that I'd been using since my days with awk in the 1970's were indeed available in Stata.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#17

16 Mar 2015, 22:38

I'm aware of the feature you just discovered as I have used it many times on this forum. I've also noticed a need to explain how the patterns I use work so I now put some extra effort towards that end. Here's one example and yet another. The FAQ does describe character classes

Square brackets denote a set of allowable characters/expressions to use in matching, such as [a-zA-Z0-9] for all alphanumeric characters.

but you are correct that it does not point out that if you add a caret ("^") as the first element, that changes the meaning to anything except the characters in the set. The FAQ also does not explain that you can use "[13579]" to target odd numbers or what to do if you need to include a caret or hyphen in a character set.

So the FAQ could be improved. I just don't understand your repeated calls to take down the FAQ. I repeat, there are no inaccuracies in the FAQ and it is a good starting point, particularly if you are new to regular expressions. It does cover the basics and more importantly makes it clear that not everything you will find on the internets will be supported.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#18

17 Mar 2015, 07:07

Stata documentation: not inaccurate.

I think Stata Corporation aspires to far more than that. I've recently worked my way through [GSM] and read my way through [U] and the other suggested reading listed at the end of [GSM]. They are well worth the effort to read.

The FAQ is apparently Stata Corporation's only attempt to define the regular expression syntax Stata supports, and it is incomplete, as noted above, and misleading because it neither acknowledges its incompleteness nor suggests how one might learn what the complete syntax is. While help regexm implies that the POSIX.2 standard is close to a basis for Stata's syntax, the FAQ suggests that the POSIX standard exceeds the scope of Stata's syntax. Neither the FAQ nor the help includes a References section, so common in other Stata documentation, to point the reader to appropriate documentation.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#19

17 Mar 2015, 07:10

Orthogonal detail: the company name is "StataCorp". The Stata company is not a corporation.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#20

17 Mar 2015, 07:26

Thanks for the tip, Nick. I had thought that StataCorp was typical online shorthand when I'd noticed in the past, I now see the results of about show StataCorp LP as the copyright holder. Someday I shall have to incorporate as LisowskiLP Corporation to keep the universe symmetric.

I take comfort in not having typed STATA Corporation.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#21

17 Mar 2015, 11:48

You just could not type that. We have ways of making your fingers go hot if you ever started.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#22

17 Mar 2015, 12:31

William, I think you misinterpret the following sentence from the help file for regexm()

Regular expression syntax is based on Henry Spencer's NFA algorithm, and this is nearly
identical to the POSIX.2 standard.

It says that Stata's implementation is based on a fellow Canadian's algorithm for regular expressions (Spencer's code is available here) and that Spencer's code is mostly POSIX.2 compliant (a more subtle point is that Spencer developed his version of regex independently from AT&T and made his code available to all free of royalties). It does not follow that Stata's implementation is nearly identical to POSIX.

So there is no contradiction between the help file and the FAQ and the FAQ makes it clear that some POSIX syntax is not part of Stata's implementation. I think it bears repeating, if you want to know what's supported in Stata's implementation, start by reading the FAQ. You may certainly scour the internets for tutorials and other resources that will help understand what can be done with the supported syntax but you will be disappointed if you think that you will find some POSIX syntax that Stata implements but does not document. You have found one exception but I know of no other.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#23

17 Mar 2015, 19:30

You are right, I misinterpreted the quoted passage from help regexm through the lens of seeking help understanding Stata's regular expression syntax. In conjunction with the FAQ, it provides an anecdote about the genesis of Stata's implementation of its "core" regular expression syntax. Without the FAQ, it provides no help, since searching out Henry Spencer's code and the POSIX standards takes you well beyond Stata's implementation.

What I've learned is that help regexm provides no actual help in understanding (the limitations of) Stata's regular expression implementation, beyond a few examples, and lacks a Reference entry pointing to the FAQ, which is apparently intended to be an authoritative discussion, but which itself is incomplete, and also lacks a Reference section. My assumption that Stata had moved beyond the minimal implementation described in the FAQ, based on the reference to POSIX in the help file and the undocumented feature discussed above, was overly optimistic.

But certainly, what documentation there is, is not inaccurate.

Last edited by William Lisowski; 17 Mar 2015, 19:50.
Comment
Navid Asgari

Join Date: Jul 2025

Posts: 30
#24

20 Mar 2015, 14:18

Thanks all for useful tips
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment