Introduction

What is a regular expression?

A regular expression is a method of representing a string matching pattern. Regular expressions enable strings that match a particular pattern within textual data records to be located and modified and they are often used within utility programs and programming languages that manipulate textual data. Regular expressions are extremely powerful.

Example applications

Various software applications use regular expressions to locate, select or modify particular sections text. For example, a regular expression could be used to:

replace the word "snake" with the word "serpent" throughout an entire piece of text

locate pieces of text containing the words "fox" and "sheep" on the same line

Regular expression components

Regular expressions are made up of three types of components:

anchors used to specify the position of the pattern in relation to a line of text.

character sets used to match one or more characters in a single position.

modifiers used to specify how many times a character set is repeated.

Syntax varies across application programs

The syntax of regular expressions varies across application programs. For example the shell uses a limited form of regular expression called shell regular expressions for filename substitution, whereas AWK uses a superset of extended regular expressions syntax.

Supporting Software

Regular expressions are supported by various software tools, including command line tools, plain text editors and programming languages. Most of these tools are available for various computing platforms, including Linux, Windows and Mac OS X. The tools use slightly different syntax styles. Let's look at some notable ones.

The tools that use regular expressions are enlisted below:

Command line
tools

Plain text
editors

Programming
languages

grep

ed

.NET

egrep

vi

Awk

sed

Emacs

Java

JavaScript

Perl

PHP

Python

Ruby

Tcl

A regular expression can be considered to be a little computer program that finds or isolates a subset of a larger set of text. In the same way that an ordinary computer program needs a computer to execute it, a regular expression needs a software application to interpret it — to give it meaning.

For example, a regular expression can be used to tell an editor to find the next occurrence of the word "Chapter" followed by several spaces and digits. Or you can use a regular expression to tell the UNIX grep command to show only those lines of a file that contain the word "Wiki" followed by either the word "Books" or the word-fragment "pedia". We will discuss the exact syntax of such regular expressions in the next chapter.

Syntaxes

There are several variants of regular expressions. These variants differ not only in their concrete syntax but also in their capabilities. Individual tools that support regular expressions also have their own peculiarities.

Greedy expressions

Quantifiers such as * and + match as much as they can: they are greedy. For some uses, their greediness does not fit. For example, let as assume you want to find the first string enclosed in quotation marks, in the following text:

These words include "cat", "mat", and "pat".

The pattern ".*" matches the italicized part of the text below, that is, "cat", "mat", and "pat" instead of the desired "cat":

These words include "cat", "mat", and "pat".

To fix this, some flavours of regular expressions provide non-greedy operators such as *?, +?, and }?. In PHP, adding a "U" at the end of the regexp makes the quantifier non-greedy, as in /".*"/U. In flavours that support neither of the two options, you can specify what is not to be matched, as in ("[^"]*") to fix the discussed example. However, when dealing with bracketed expressions, (\[\[[^\]]*\]\]) fails to match on A B C [[D E] F G]].

Comparison table

A comparison table or matrix that shows which features or flavors of regular expressions are available in which tool or programming language is available from regular-expressions.info.

Simple Regular Expressions

The Simple Regular Expression syntax is widely used on Unix based systems for the purposes of backwards compatibility. Most regular-expression-aware Unix utilities, such as grep and sed, use it by default while providing support for extended regular expressions with command line arguments (see below). This syntax is deprecated on POSIX compliant systems and should not be used by new utilities.

boxes enable a single character to be matched against character lists or character ranges.

[^ ]

A complement box enables a single character not within in a character list or character range to be matched.

^

A caret anchor matches the start of the line (or any line, when applied in multiline mode)

$

A dollar anchor matches the end of the line (or any line, when applied in multiline mode)

( )

parentheses are used to define a marked subexpression. The matched text section can be recalled at a later time.

\n

Where n is a digit from 1 to 9; matches what the nth marked subexpression matched. This irregular construct has not been adopted in the extended regular expression syntax.

*

A single character expression followed by "*" matches zero or more copies of the expression. For example, "ab*c" matches "ac", "abc", "abbbc" etc. "[xyz]*" matches "", "x", "y", "zx", "zyx", and so on.

\n*, where n is a digit from 1 to 9, matches zero or more iterations of what the nth marked subexpression matched. For example, "(a.)c\1*" matches "abcab" and "abcabab" but not "abcac".

An expression enclosed in "\(" and "\)" followed by "*" is deemed to be invalid. In some cases (e.g. /usr/bin/xpg4/grep of SunOS 5.8), it matches zero or more iterations of the string that the enclosed expression matches. In other cases (e.g. /usr/bin/grep of SunOS 5.8), it matches what the enclosed expression matches, followed by a literal "*".

Examples

Examples:

"^[hc]at"

Matches hat and cat but only at the beginning of a line.

"[hc]at$"

Matches hat and cat but only at the end of a line.

Use in Tools

Tools and languages that utilize this regular expression syntax include:

Basic Regular Expressions

Basic Regular Expressions: Note that particular implementations of regular expressions interpret the backslash symbol differently in front of some of the metacharacters. For example, egrep and perl interpret unbackslashed parentheses and vertical bars as metacharacters, reserving the backslashed versions to mean the literal characters themselves. Old versions of grep did not support the pipe alternation operator.

Operators

Operator

Effect

.

The dot operator matches any single character.

[ ]

boxes enable a single character to be matched against a character lists or character range.

[^ ]

A compliment box enables a single character not within in a character list or character range to be matched.

*

An asterisk specifies zero or more characters to match.

^

The caret anchor matches the beginning of the line

$

The dollar anchor matches the end of the line

Examples:

Example

Match

".at"

any three-character string like hat, cat or bat

"[hc]at"

hat and cat

"[^b]at"

all the matched strings from the regex ".at" except bat

"^[hc]at"

hat and cat but only at the beginning of a line

"[hc]at$"

hat and cat but only at the end of a line

Since many ranges of characters depends on the chosen locale setting (e.g., in some settings letters are organized as abc..yzABC..YZ while in some others as aAbBcC..yYzZ).

The Posix Basic Regular Expressions syntax provided extensions for consistency between utility programs such as grep, sed and awk. These extensions are not supported by some traditional implementations of Unix tools.

Use in Tools

Perl-Compatible Regular Expressions

Perl has a richer and more predictable syntax than even the POSIX Extended Regular Expressions syntax. An example of its predictability is that \ always quotes a non-alphanumeric character. An example of something that is possible to specify with Perl but not POSIX is whether part of the match wanted to be greedy or not. For instance in the pattern /a.*b/, the .* will match as much as it can, while in the pattern /a.*?b/, .*? will match as little. So given the string "a bad dab", the first pattern will match the whole string, and the second will only match "a b".

For these reasons, many other utilities and applications have adopted syntaxes that look a lot like Perl's. For example, Java, Ruby, Python, PHP, exim, BBEdit, and even Microsoft's .NET Framework all use regular expression syntax similar to that used in perl. Not all "Perl-compatible" regular expression implementations are identical, and many implement only a subset of Perl's features.

Examples

Conventions used in the examples: The character 'm' is not always required to specify a perl match operation. For example, m/[^abc]/ could also be rendered as /[^abc]/. The 'm' is only necessary if the user wishes to specify a match operation without using a forward-slash as the regex delimiter. Sometimes it is useful to specify an alternate regex delimiter in order to avoid "delimiter collision". See 'perldoc perlre' for more details.

Links

POSIX Basic Regular Expressions

The POSIX Basic Regular Expression (BRE) syntax provided extensions to achieve consistency between utility programs such as grep, sed and awk. These extensions are not supported by some traditional implementations of Unix tools.

History

Traditional Unix regular expression syntax followed common conventions that often differed from tool to tool. The POSIX Basic Regular Expressions syntax was developed by the IEEE, together with an extended variant called Extended Regular Expression syntax. These standards were designed mostly to provide backward compatibility with the traditional Simple Regular Expressions syntax, providing a common standard which has since been adopted as the default syntax of many Unix regular expression tools.

Syntax

In POSIX Basic Regular Expression syntax, most characters are treated as literals — they match only themselves (e.g., a matches "a"). The exceptions, listed below, are called metacharacters or metasequences.

Metacharacter

Description

.

Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor, character encoding, and platform specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".

[ ]

A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c", and [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first character within the brackets: [abc-], [-abc]. The ] character can be included in a bracket expression if it is the first character: []abc]. The bracket expression may also contain character classes, equivalence classes, and collating characters.

[^ ]

Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c", and [^a-z] matches any single character that is not a lowercase letter from "a" to "z". These forms can be mixed: [^abcx-z] matches any character other than "a", "b", "c", "x", "y", or "z".

The - character is treated as a literal character if it is the last character or the first characted after ^: [^abc-], [^-abc]. The ] character is treated as a literal character if it is the first character after ^: [^]abc]. The expression may also contain character classes, equivalence classes, and collating characters.

^

Matches the starting position within the string, if it is the first character of the regular expression.

$

Matches the ending position of the string, if it is the last character of the regular expression.

Matches the preceding element at least m and not more than n times. For example, a\{3,5\} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few older instances of regular expressions.

BRE: \( \)ERE: ( )

Defines a subexpression. It is treated as a single element. For example, ab* matches "a", "ab", "abb" and so on, while \(ab\)* matches "", "ab", "abab", "ababab", and so on. The string matched within the parentheses can be recalled later (see the next entry, \n). A subexpression is also called a marked subexpression, a block or a capturing group.

BRE only: \n

Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular (an expression with this construct does not obey the mathematical definition of regular expression), and was not adopted in the POSIX ERE syntax.

Examples:

.at matches any three-character string ending with "at", including "hat", "cat", and "bat".

[hc]at matches "hat" and "cat".

[^b]at matches all strings matched by .at except "bat".

^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.

[hc]at$ matches "hat" and "cat", but only at the end of the string or line.

\[.\] matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".

Character classes

The POSIX standard defines some classes or categories of characters as shown below. These classes are used within brackets.

Collating symbols

Collating symbols, like character classes, are used in brackets and have the form [.ch.]. Here ch is a digraph. Collating systems are defined by the locale.

Equivalence classes

Equivalence classes, like character classes and collating symbols, are used in brackets and have the form [=a=]. They stand for any character which is equivalent to the given. According to the standard[1],

For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]".

Equivalence classes, like collating symbols, are defined by the locale.

External links

Use in Tools

POSIX-Extended Regular Expressions

The more modern "extended" regular expressions can often be used with modern Unix utilities by including the command line flag "-E".

POSIX extended regular expressions are similar in syntax to the traditional Unix regular expressions, with some exceptions. The following metacharacters are added:

Metacharacter

Description

.

Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor, character encoding, and platform specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".

[ ]

A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first (after the ^) character within the brackets: [abc-], [-abc]. Note that backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].

[^ ]

Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". As above, literal characters and ranges can be mixed.

^

Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

$

Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.

BRE: \( \)ERE: ( )

Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group.

\n

Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups.

Links

Non-POSIX Basic Regular Expressions

Non POSIX Basic Regular Expression Syntax: An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This form of regular expression is used to reflect the fact that in many programming languages these characters may be used in identifiers.

Operators

Operator

Effect

.

The dot operator matches any single character.

[ ]

boxes enable a single character to be matched against a character lists or character range.

[^ ]

A compliment box enables a single character not within in a character list or character range to be matched.

*

An asterisk specifies zero or more characters to match.

^

The caret anchor matches the beginning of the line

$

The dollar anchor matches the end of the line

The editor vim further distinguishes word and word-head classes (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.

(For an ASCII chart color-coded to show the POSIX classes, see ASCII.)

Use in Tools

Tools and languages that utilize this regular expression syntax include:

Use in Tools

Links

Shell Regular Expressions

The Unix shell recognises a limited form of regular expressions used with filename substitution:

Operators

Operator

Effect

?

The hook operator specifies any single character.

[ ]

boxes enable a single character to be matched against a character lists or character range.

[! ]

A compliment box enables a single character not within in a character list or character range to be matched.

*

An asterisk specifies zero or more characters to match.

?(pattern-list)

Matches zero or one occurrence of the given patterns.

*(pattern-list)

Matches zero or more occurrences of the given patterns.

+(pattern-list)

Matches one or more occurrences of the given patterns.

@(pattern-list)

Matches exactly one of the given patterns.

!(pattern-list)

Matches anything except one of the given patterns.

Note that the latter five constructs can only be used in bash and only if the extglob option has been enabled using the bash-builtin "shopt".

Some operators behave differently in the shell: The asterisk and hook operators do not not need to follow a previous character in the shell and they exhibit non traditional regular expression behaviour.

Unsupported Constructs: Within the shell, a compliment box is formed using the pling symbol. The shell does not support the use of a careted box for character list exclusion. In the shell, a caret symbol within a box will simply be treated as one of the characters within the character list for matching.

Use in Tools

Tools and languages that utilize this regular expression syntax include:

Bourne compatible shells

Links

Implementation

Implementations and running times

There are at least 3 different algorithms that decide if (and how) a given string matches a regular expression. They are based on different representations of the regular expression as a Finite Automation and on the amount of functionality present in the matcher.

An NFA based matcher without back-references and look ahead/behind. An input of size O(n) can be tested against a regular expression of size O(m) in time O(nm), and additional O(m) extra space by simulating an NFA using Thompson's algorithm. If c sub-match capture groups are to be recorded, then the running time increases to O(nm log c), but the space requirement remains O(m).

An NFA based matcher with back-references and look ahead/behind. Such a matcher needs to be implemented using backtracking. An input of size O(n) can be tested against a regular expression of size O(m) in time O(2mn) using backtracking. Some effort is needed to ensure that the backtracking based matcher doesn't enter an infinite loop, testing the same path over and over again.

A DFA based matcher. DFA based matchers can't support back-references, sub-match captures, or look ahead/behind. This is the oldest and fastest kind of matcher and relies on a result in formal language theory that allows every nondeterministic Finite State Machine (NFA) to be transformed into a deterministic finite state machine (DFA). The algorithm performs or simulates this transformation and then runs the resulting DFA on the input string, one symbol at a time. The latter process (DFA matching) takes time that is proportional to the length of the input string. More precisely, a regular expression of size m on an input alphabet of size S can be converted into a DFA in time O(2mS), and subsequently an input string of size n can be tested against a DFA of any size in time O(n).

The DFA based algorithm is fast to match input against a regular expression, but can be used only for matching and not for recalling grouped subexpressions. There is a variant that can recall grouped subexpressions, but its running time slows down to O(n2m)[citation needed].

The running time of the backtracking based algorithm can be exponential, which simple implementations exhibit when matching against expressions like "(a|aa)*b" that contain both alternation and unbounded quantification and force the algorithm to consider an exponential number of subcases. More complex implementations identify and speed up various common cases where they would otherwise run slowly.

Even though backtracking implementations only give an exponential guarantee in the worst case, they allow much greater flexibility and provide more expressive power. For instance any implementation that allows the use of backreferences, or implements the various improvements that Perl introduced, must use a backtracking implementation.

Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.