st: RE: regular expressions in Stata

Scott wrote:
Does anyone know how regular expressions are implemented in Stata?
......
Off the list, Kevin Turner, StataCorp, emailed me the following, which
answers Scott's question:
Getting more to the technical details, the areas that our RE parser is
not
POSIX compliant are:
1) No support for what is called a 'bound', which is the curly
brace
{#} that denotes a count of items to be matched.
2) No support for character classes within bracket expressions.
[:alnum:] [:digit:] [:alpha:]
are all examples. This is also very similar to Perl's use of
\w \W
\s etc. to denote character classes. I don't believe Perl's
syntax is POSIX, however. I would have to double-check that.
3) Any obscure syntax rules that relate to brackets, but as I
read the
spec, these are usually the result of character classes.
Stata's RE parser (which is a derived from Spencer's), has all of the
basic,
RE syntax items:
1) Atoms for matching zero or more, 1 or more, or one or none:
*+?
2) Subexpressions denoted by parenthesis. Btw, subexpression 0
will
always return the entire string matched by the RE string.
3) Branches, which are denoted with pipes: |
4) Atoms for beginning of line and end of line: ^$
5) Atom for matching any character, which is represented as a
period.
6) Support for 'escaping' any reserved character with a
backslash.
For example, denoting a literal dollar sign could be done
with \$ 7) Support for bracket expressions, which are used to
list a collection
of valid characters to match. [0-9a-z] is an example. [abc]
is
another.
So, to sum it up, the few areas where we are not POSIX compliant are
really
in, what I would term, 'shortcut syntax' of the POSIX specification. In
other words, you may not have a counting syntax with curly braces but
you can list out the long form of the RE to match the number you wish.
Also, you might not have a shortcut class for all alphanumeric
characters with [:alnum:] but you can certainly write the long form,
which is [0-9a-zA-Z].
..Frank
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/