A new regular expression handler in Allegro CL 6.2 was made available
with a patch released around June 7, 2004. Use sys:update-allegro to download the
patch. The new module is named :regexp2 (you have the patch if
there is a regexp2.fasl in the
code/ subdirectory of the Allegro directory).

The new regexp2 module has roughly the same API as the older
regexp module (which came with the original Allegro CL 6.2
release), except the functions have slightly different names, to avoid
name conflicts. The functions are named by symbols in the
excl package, and are named compile-re, match-re,
split-re, and replace-re. (The older
module has functions compile-regexp, match-regexp, split-regexp, and replace-regexp.) Both modules can be loaded
and used at the same time in a running Lisp.

The new regexp2 module provides a fast, mostly-Perl-compatible
regular expression matcher. It handles Unicode character set
(UCS-2). The module was made available by a patch on approximately
June 7, 2004. Download this patch (along with other patches) in the
usual fashion with sys:update-allegro. Symbols in the regexp2
module are in the excl package.

You load the regexp2 module with the form

(require :regexp2)

All the functionality of the regexp2 module is described in this
document. There are no separate documentation pages for operators
etc. in the module at this time. Note that there are documentation
pages for operators etc. in the older regexp module (described in
section Section 3.0 The older regexp API below in this
document).

Like Perl, there are four types of 'mode switches' that affect the
meaning of regular expressions. The switch can be specified by the
keyword arguments passed to regexp APIs, or by embedding 'flags' in
the regular expression itself. The following table lists the valid
mode switches.

Treats the input string as multiple lines. If turned on, "^" and "$"
matches the start and end of any line instead of the beginning and end
of the whole string.

s

:single-line

Treats the input string as a single line. If turned on, "." matches
any character, even a newline. Normally "." matches any character
except a newline.

x

:ignore-whitespace

Extend syntax. Whitespace in the regular expression is ignored, and
comments can be inserted, to increase legibility of the regular
expression.

Within a regular expression, a mode switch can be turned on/off
locally by (?:...) construct. For example,
(?i:foo) makes foo match case
insensitively. (?-i:foo) makes
foo match case sensitively. You can combine
multiple flags, like (?im-sx:foo). A construct
like (?i) changes the mode switch until the end of
the current grouping.

A word-constituent character. Currently, alphanumeric characters, #\_,
and any character whose code is larger than 128 is considered as a
word constituent character. The treatment of large character set may
be improved in the later versions.

\W

A non-word-constituent character

Character classes

Construct

Matches

[abc]

a, b, or c

[^abc]

any character except a, b, or c

[a-zA-Z]

character range (a through z or A through Z)

\n, \nn, \mnn, \x, \xh, \xhh, \x{h..} \\, \a, \e, \f, \t, \n, \r, \cx

Octal and hexadecimal character notation, and the above single
character escape, are valid in character class as well.

other

\d, \D, \s, \S, \w, and \W can also be used within character class
notation, e.g. [abc\d] == [abc0-9].

Assertions

Construct

Matches

^

Beginning of a string. (if 'm' flag is on, matches at the beginning
of a string or just after a linefeed).

$

End of a string. (if 'm' flag is on, matches at the end of a string
or just before a linefeed).

Grouping and special constructs

Groups X, and captures submatch, assigning name as the name of the
submatch. Name must consist of a sequence of word-constituent
characters.

(?:X)

Groups X, but does no captures.

(?imsx-imsx:X)

Groups X, with turning on/off the given flags.

(?imsx-imsx)

Successfully matches nothing, and changes the given flags within the
current group.

(?=X)

Zero-width positive look-ahead

(?!X)

Zero-width negative look-ahead

(?<=X)

Zero-width positive look-behind

(?<!X)

Zero-width negative look-behind

(?>X)

Matches X, and never backtrack. (also known as "standalone" group)

(?(n)Y)

If submatch N has a value, match Y.

(?(n)Y|Z)

If submatch X has a value, match Y; otherwise match Z.

(?(?=X)Y|Z)
(?(?!X)Y|Z)
(?(?<=X)Y|Z)
(?(?<!X)Y|Z)

Try the look-ahead or look-behind assertion of X, and if succeeds,
match Y; otherwise match Z. '|Z' part can be omitted.

Back reference

Construct

Matches

\n

Matches the n-th submatch (n>=1). If n-th submatch doesn't have a
value, it doesn't match anything (match just fails). If n is more than
one digit, it becomes a back reference only if there are that many
capturing groups in the regular expressions; otherwise, it is
interpreted as an octal character notation if possible.

\k<name>

Matches the submatch with name. If there are more than one submatches
that has the name, this tries to match the one that has the biggest
number first, and if it fails, the smaller numbered submatches are
tried.

Capturing submatches (X) and
(?<name>X) are numbered in the order of its
opening parenthesis from left to right. Named submatch is counted the
same as unnamed submatch, and can be back-referenced by both name and
number.

When the input string matches X, the portion of the input string is saved.
It can be referenced within a regular expression, by the back reference
construct, or can be returned to the user program as a submatch.

If the capturing construct matches more than once, it saves the last
match.

Most functions that accept a regexp string such as match-re and compile-re also accept a regexp tree. A regexp
tree is an s-expression with the syntax described below. The syntax
was defined by CL-PPCRE and is intended to be compatible with it. This
documentation was taken from the CL-PPCRE documentation, available as
http://www.weitz.de/cl-ppcre/.

Programmers are usually most familiar with regexp string syntax, and
it suffices for many normal regexp applications. However, string
syntax does not scale very well -- complex regexps are hard to write
and even harder to parse. The frequent need for backslach escapes is
a further complication. In such cases, programmers may find tree
syntax easiler to code in Lisp source code editors with their autmatic
indentation and parentheses matching.

Further, in any application that generates regular expressions on the
fly will undoubtedly find it easier to generate s-expr tress than
trying to perform and extra level of encoding into string syntax, only
to force the regexp system immediately to parse the string back.

Tree syntax is as follows:

Every string and character is a parse tree and is treated literally as
a part of the regular expression, i.e. parentheses, brackets,
asterisks and such aren't special.

The symbol :void is equivalent to the empty string.

The symbol :everything is equivalent to Perl's dot,
i.e it matches everything (except maybe a newline character depending
on the mode).

The symbols :word-boundary and
:non-word-boundary are equivalent to Perl's "\b"
and "\B".

The symbols :case-insensitive-p,
:case-sensitive-p,
:multi-line-mode-p,
:not-multi-line-mode-p,
:single-line-mode-p, and
:not-single-line-mode-p are equivalent to Perl's
embedded modifiers "(?i)", "(?-i)", "(?m)", "(?-m)", "(?s)", and
"(?-s)". As usual, changes applied to modes are kept local to the
innermost enclosing grouping or clustering construct.

(:flags {<modifier>}*) where <modifier>
is one of the modifier symbols from above is used to group modifier
symbols. The modifiers are applied from left to right. (This construct
is obviously redundant. It is only there because it's used by the
parser.)

(:sequence {<parse-tree>}*) means a sequence
of parse trees, i.e. the parse trees must match one after
another. Example: (:sequence #\f #\o #\o) is
equivalent to the parse tree "foo".

(:group {<parse-tree>}*) is like
:sequence but changes applied to modifier flags
(see above) are kept local to the parse trees enclosed by this
construct. Think of it as the S-expression variant of Perl's
"(?:<pattern>)" construct.

(:alternation {<parse-tree>}*) means an
alternation of parse trees, i.e. one of the parse trees must
match. Example: (:alternation #\b #\a #\z) is
equivalent to the Perl regex string "b|a|z".

(:branch <test> <parse-tree>) is for
conditional regular expressions with the syntax

(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)

<test> is either a number which stands for a register or a parse
tree which is a look-ahead or look-behind assertion. If the parse-tree
following the condition is an alteration, it must enclose exactly one
or two parse trees where the second one (if present) will be treated
as the "no-pattern" - in all other cases <parse-tree> will be
treated as the "yes-pattern", executed if the condition matched. If
the condition refers to a register, the condition succeeds if that
register already succeeded in matching somehing.

(:positive-lookahead|:negative-lookahead|:positive-lookbehind|:negative-lookbehind
<parse-tree>) -- the meaning should be pretty obvious.

(:greedy-repetition|:non-greedy-repetition
<min> <max> <parse-tree>) where <min> is a
non-negative integer and <max> is either a non-negative integer
not smaller than <min> or nil will
result in a regular expression which tries to match <parse-tree>
at least <min> times and at most <max> times (or as often
as possible if <max> is nil). So, e.g.,
(:non-greedy-repetition 0 1 "ab") is equivalent to
the Perl regex string "(?:ab)??".

(:register <parse-tree>) is a capturing
register group. As usual, registers are counted from left to right
beginning with 1.

(:back-reference <number>) where
<number> is a positive integer is a back-reference to a register
group. (:char-class|:inverted-char-class
{<item>}*) where <item> is either a character, a
character range, or a symbol for a special character class (see above)
will be translated into a (one character wide) character class. A
character range looks like (:range <char1>
<char2>) where <char1> and <char2> are
characters such that (char<= <char1>
<char2>) is true. Example:
(:inverted-char-class #\a (:range #\d #\g)
:digit-class) is equivalent to the Perl regex string
"[^aD-G\d]".

There is a small region of ambiguity between string syntax and tree
syntax. Although a single string is a valid parse tree, it will be
interpreted as a Perl regexp strings when given to compile-re and
friends. To circumvent this you can use the equivalent parse tree
(:GROUP <string>). See also quote-re.

If you want to find out how parse trees are related to Perl regex
strings you should play around with parse-re - a function which converts a Perl
regexp strings to a parse tree.

Traditionally, '{' and '}' characters that do not consist of a valid
repetition syntax are taken literally. That is, a regular expression
"x{1,3,4}" matches the string "x{1,3,4}", and a regular expression
"a{" matches the string "a{". Currently, these regular expressions
raise a syntax error in our regexp library.

Theoretically ACL's regexp library uses the same mechanism that Perl
and CL-PPCRE are using: Nondeterministic finite automaton (NFA). When
there are multiple possibilies of matching, it "remembers" the current
state and tries each possibility one at a time. If a trial fails, it
backs up the last saved state and tries the next possibility; that is
called a "backtrack". It is possible to compose a very short regular
expression that does a huge number of backtracks; if you have nested
repetitions and alternations, the number of required backtracks grows
exponentially.

The regexp optimizer tries to reduce the number of backtracks, but it is
not always possible. Here are some tips to improve the performance
of the matcher.

If applicable, use of standalone group (?>X) in
the inner loop. It effectively turns the nested repetition into
unnested repetition from the viewpoint of NFA. Lots of optimizations
can be done for unnested repetition, but not much can be done for
a nested one.

If the input string is very long, it is a good idea to split it up by
non-backtracking regexp first, then apply more compilicated regexp to
the small chunks. Non-backtracking NFA runs in constant space and
linear time to the input length. Single-nested backtracking NFA runs
in linear space and quadratic time to the input length.

Note that an unnested greedy repetition followed by a character or a
character set that are exclusive to the beginning of the repetition
becomes a non-backtracking regexp; for example, the regular expression
(\w+)\s+ runs without backtracking.

A regular expression is a pattern that matches one or more
strings. Technically, the patterns we support go beyond regular
expressions, however we'll continue to call them regular
expressions. Allegro CL provides regular expression (regexp)
functionality as described in this document.

In a regular expression string, characters are either normal
or special. A normal character stands for itself, a special
character has a meaning described below.

There are two contexts in a regular expression: characters within a
[...] expression and those elsewhere.

Special characters outside a [...] are: * . [ ] \ + ^
$ (asterisk, period, left bracket, right bracket, backslash,
plus sign, circumflex accent, and dollar sign).
These characters can be made normal characters by preceding them
with a backslash. Some of these characters are only special in
certain places in the string (see below for the details).

Special characters within a [...] are: ^ ].
These characters can only be made non-special by placing them in
a certain position in the [...] expression (more details
below). Note that backslash is not a special character in this
context.

Where x and y are regular expressions,
matches the concatenation of the
regular expressions.

m*

A single character regular expression followed by *
matches zero or more occurrences
of m. If there is a choice, it always matches
the longest sequence of m's.

m+

A single character expression followed by + matches one
or more occurrences of m.
If there is a choice, it always matches the longest
sequence of m's.

.

A period matches any character (except newline--see notes
on the match-regexp
function).

^

If this is the first character of the regular
expression string it forces the match to
start at the beginning of the to-be-matched string.
If this character appears after the
beginning of the string it stands for itself.

$

If this is the last character of the regular expression
string it forces the match to
end at the end of the to-be-matched string. If this
character appears before the end of
the regular expression string, it stands for itself.

[..]

This matches exactly one character from the set of
characters denoted by the pattern
inside the brackets. This pattern has a different form
than elsewhere in the regular
expression: [abcs-y3-8] matches
either a, b
or c,
or s through y,
or 3 through 8. You can
invert the set using the caret as the first character.
[^a-z] matches any
character not in the range a through
z. In order to include the
right bracket in the set it has to be listed first
(or after the caret): []ab]
matches a, b
or the right bracket. [^]ab] matches
any character except a,
b or the right bracket. In order to
match a hyphen it has to be first or last:
[b-] matches b or a
hyphen. In order to match a caret it has to be
somewhere other than the first character. [b^]
matches b or a caret.

\(x\)

This grouping syntax matches whatever x
matches, and at the same time
remembers what x matches. There can be up to
9 groups defined in a regular
expression string. Each group is given a number from 1
to 9 based on the order in which
they appear in the pattern string.

x\|y

This tries to match x, and if that fails
it tries to match y. To
control what constitutes x and y
you can use the \( \)
grouping. For example, abc\|def
means abc or def
whereas a\(bc\|de\)f means
a followed by bc
or de
followed by f.

\n

If n is 1 through 9 then this stands for
the string matched by group 1
through 9. If there is no string assigned to group
n then this is match failure.
There is no group 0 so the form \0 is illegal
and an error is signaled when the regular
expression is parsed.

\w

Matches a word character. It is equivalent
to [a-zA-Z0-9].

\W

Matches anything but a word character. It is
equivalent to [^a-zA-Z0-9].

\b

Matches a blank character (one of space, form feed,
tab and newline).

\B

matches anything but a blank character (one of space,
form feed, tab and newline).

\a

For any character a not mentioned above stands for
a itself. It is unwise to
put in extra backslashes since while \x may stand for just x
today, in the future it may
have a different meaning.

When typing a regular expression in Lisp source code keep in mind that
in order to represent a backslash in a string constant you need two
backslashes. The Lisp reader reads "foo\+" as
"foo+", when what you probably wanted was "foo\\+"
(where you are putting a backslash in front of the + to remove its
special meaning so you could match the string foo+.)

The + and * characters must follow a single character regular
expression. They cannot follow a group expression, even if that group
matches just one character. In other words \(a\)* is not
legal. [a]* is legal since the [..]
expression always matches one character.

Compatibility with other regular expression parsers

Various operating systems and other utilities provide regular expression
parsers. Here are some notes on some of them:

The UNIX version (on Solaris) supports x\{m,n\}
meaning between m and n
occurrences of x. It also supports
\< and \> as matching word
beginnings and word endings, where a word is a C identifier. The UNIX
version does not support the + special character
(possibly since you can get it with \{1,\}).

The Linux version supports x? meaning 0 or 1
occurrence of x. We can get that with \(x\|\).

match-regexp

Function

Package: excl

The regexp argument is a regular
expression object (the result of regexp-compile) or
it is a string (in which case it will be compiled into a regular
expression object). The match-string is a string to
match against the regular expression. The function will attempt to
match the regular expression against the
match-string starting at the first character of the
match-string, and if that fails it will look inside
the match-string for a match (unless the regular
expression begins with a caret).

The keyword arguments are:

:newlines-special

If true then a newline will not match the . [i.e. the dot]
regular expression. This is useful to prevent multiline matches.

:case-fold

If true then the match-string
is effectively mapped to
lower case before doing the match. Thus lower case
characters in the regular expression
match either case and upper case characters match only
upper case characters.

:return

A failed match returns
nil. If the
value of return is :string then
a
successful match returns multiple values. The first
value is t.
The second value is the substring of the
match-string that
matched the regular expression. The third value
(if any) is the substring that matched
group 1. The fourth value is the substring that
matched group 2. And so on. If you use the
\| form, then some groups may have no associated
match in which case nil
will be returned as that value. In highly nested \| forms, a
group may return a match string when in the final match
that group had no match.

If the
value of return is
:index then it is just like
:string except that instead of the
strings being returned, a
cons is returned giving the start and
end indices in the original match-string
of the match. The end index is one greater than the
last character in the substring.

If the value of return is
nil then the one value
t is returned when the match succeeds.

:start

The first character in the match-string to match
against.

:end

One past the last character in the match-string to
match against.

shortest

This makes match-regexp
return the shortest rather than the longest
match. One motivation for this is parsing html. Suppose you
want to search for the next
item in italics, which in html looks like
<i>foo</i>
. Suppose your string is
"<i>foo</i> and <i>bar</i>".
The following example shows the difference:

Compilation note: there is a compiler macro
defined for match-regexp that will handle in a
special way match-regexp calls where the first
argument is a constant string. That is, this form
(match-regexp "foo" x) will compile to
code that will arrange to call compile-regexp on
the string when the code is fasled in. Since the cost of
compile-regexp is high, this saves a lot of
time.

Copyright (c) 1998-2009, Franz Inc. Oakland, CA., USA. All rights reserved.Documentation for Allegro CL version 8.1. This page was not revised from the 8.0 page.Created 2009.7.29.