Rather than repeating the description of REs for each utility supporting REs, the standard developers preferred a common,
comprehensive description of regular expressions in one place. The most common behavior is described here, and exceptions or
extensions to this are documented for the respective utilities, as appropriate.

The BRE corresponds to the ed or historical grep type, and the ERE corresponds to the historical egrep type (now grep-E).

The text is based on the ed description and substantially modified, primarily to aid
developers and others in the understanding of the capabilities and limitations of REs. Much of this was influenced by
internationalization requirements.

It should be noted that the definitions in this section do not cover the tr utility;
the tr syntax does not employ REs.

The specification of REs is particularly important to internationalization because pattern matching operations are very basic
operations in business and other operations. The syntax and rules of REs are intended to be as intuitive as possible to make them
easy to understand and use. The historical rules and behavior do not provide that capability to non-English language users, and do
not provide the necessary support for commonly used characters and language constructs. It was necessary to provide extensions to
the historical RE syntax and rules to accommodate other languages.

As they are limited to bracket expressions, the rationale for these modifications is in XBD RE Bracket Expression.

It is possible to determine what strings correspond to subexpressions by recursively applying the leftmost longest rule to each
subexpression, but only with the proviso that the overall match is leftmost longest. For example, matching
"\(ac*\)c*d[ac]*\1" against acdacaaa matches acdacaaa (with \1=a); simply matching the longest match
for "\(ac*\)" would yield \1=ac, but the overall match would be smaller (acdac). Conceptually, the
implementation must examine every possible match and among those that yield the leftmost longest total matches, pick the one that
does the longest match for the leftmost subexpression, and so on. Note that this means that matching by subexpressions is
context-dependent: a subexpression within a larger RE may match a different string from the one it would match as an independent
RE, and two instances of the same subexpression within the same larger RE may match different lengths even in similar sequences of
characters. For example, in the ERE "(a.*b)(a.*b)", the two identical subexpressions would match four and six characters,
respectively, of accbaccccb.

The definition of single character has been expanded to include also collating elements consisting of two or more characters;
this expansion is applicable only when a bracket expression is included in the BRE or ERE. An example of such a collating element
may be the Dutch ij, which collates as a 'y'. In some encodings, a ligature "i with j" exists as a character and
would represent a single-character collating element. In another encoding, no such ligature exists, and the two-character sequence
ij is defined as a multi-character collating element. Outside brackets, the ij is treated as a two-character RE and
matches the same characters in a string. Historically, a bracket expression only matched a single character. The
ISO POSIX-2:1993 standard required bracket expressions like "[^[:lower:]]" to match multi-character collating
elements such as "ij". However, this requirement led to behavior that many users did not expect and that could not
feasibly be mimicked in user code, and it was rarely if ever implemented correctly. The current standard leaves it unspecified
whether a bracket expression matches a multi-character collating element, allowing both historical and ISO POSIX-2:1993
standard implementations to conform.

Also, in the current standard, it is unspecified whether character class expressions like "[:lower:]" can include
multi-character collating elements like "ij"; hence "[[:lower:]]" can match "ij", and
"[^[:lower:]]" can fail to match "ij". Common practice is for a character class expression to match a collating
element if it matches the collating element's first character.

The definition of which sequence is matched when several are possible is based on the leftmost-longest rule historically used by
deterministic recognizers. This rule is easier to define and describe, and arguably more useful, than the first-match rule
historically used by non-deterministic recognizers. It is thought that dependencies on the choice of rule are rare; carefully
contrived examples are needed to demonstrate the difference.

A formal expression of the leftmost-longest rule is:

The search is performed as if all possible suffixes of the string were tested for a prefix matching the pattern; the
longest suffix containing a matching prefix is chosen, and the longest possible matching prefix of the chosen suffix is identified
as the matching sequence.

Historically, most RE implementations only match lines, not strings. However, that is more an effect of the usage than of an
inherent feature of REs themselves. Consequently, POSIX.1-2017 does not regard <newline> characters as special; they are
ordinary characters, and both a <period> and a non-matching list can match them. Those utilities (like grep) that do not allow <newline> characters to match are responsible for eliminating any
<newline> from strings before matching against the RE. The regcomp() function,
however, can provide support for such processing without violating the rules of this section.

Some implementations of egrep have had very limited flexibility in handling complex EREs. POSIX.1-2017 does not attempt
to define the complexity of a BRE or ERE, but does place a lower limit on it-any RE must be handled, as long as it can be expressed
in 256 bytes or less. (Of course, this does not place an upper limit on the implementation.) There are historical programs using a
non-deterministic-recognizer implementation that should have no difficulty with this limit. It is possible that a good approach
would be to attempt to use the faster, but more limited, deterministic recognizer for simple expressions and to fall back on the
non-deterministic recognizer for those expressions requiring it. Non-deterministic implementations must be careful to observe the
rules on which match is chosen; the longest match, not the first match, starting at a given character is used.

The term "invalid" highlights a difference between this section and some others: POSIX.1-2017 frequently avoids mandating of
errors for syntax violations because they can be used by implementors to trigger extensions. However, the authors of the
internationalization features of REs wanted to mandate errors for certain conditions to identify usage problems or non-portable
constructs. These are identified within this rationale as appropriate. The remaining syntax violations have been left implicitly or
explicitly undefined. For example, the BRE construct "\{1,2,3\}" does not comply with the grammar. A conforming
application cannot rely on it producing an error nor matching the literal characters "\{1,2,3\}".

The term "undefined" was used in favor of "unspecified" because many of the situations are considered errors on some
implementations, and the standard developers considered that consistency throughout the section was preferable to mixing undefined
and unspecified.

BRE Ordinary Characters

BRE Special Characters

Periods in BREs

There is no additional rationale provided for this section.

RE Bracket Expression

Range expressions are, historically, an integral part of REs. However, the requirements of "natural language behavior" and
portability do conflict. In the POSIX locale, ranges must be treated according to the collating sequence and include such
characters that fall within the range based on that collating sequence, regardless of character values. In other locales, ranges
have unspecified behavior.

Some historical implementations allow range expressions where the ending range point of one range is also the starting point of
the next (for instance, "[a-m-o]" ). This behavior should not be permitted, but to avoid breaking historical
implementations, it is now undefined whether it is a valid expression and how it should be interpreted.

Current practice in awk and lex is to
accept escape sequences in bracket expressions as per XBD Escape Sequences and
Associated Actions, while the normal ERE behavior is to regard such a sequence as consisting of two characters. Allowing
the awk/ lex behavior in EREs would change
the normal behavior in an unacceptable way; it is expected that awk and lex will decode escape sequences in EREs before passing them to regcomp() or comparable routines. Each utility describes the escape sequences it accepts as
an exception to the rules in this section; the list is not the same, for historical reasons.

As noted previously, the new syntax and rules have been added to accommodate other languages than English. The remainder of this
section describes the rationale for these modifications.

In the POSIX locale, a regular expression that starts with a range expression matches a set of strings that are contiguously
sorted, but this is not necessarily true in other locales. For example, a French locale might have the following behavior:

Such disagreements between matching and contiguous sorting are unavoidable because POSIX sorting cannot be implemented in terms
of a deterministic finite-state automaton (DFA), but range expressions by design are implementable in terms of DFAs.

Historical implementations used native character order to interpret range expressions. The ISO POSIX-2:1993 standard
instead required collating element order (CEO): the order that collating elements were specified between the order_start and
order_end keywords in the LC_COLLATE category of the current locale. CEO had some advantages in portability over the
native character order, but it also had some disadvantages:

CEO could not feasibly be mimicked in user code, leading to inconsistencies between POSIX matchers and matchers in popular user
programs like Emacs, ksh, and Perl.

CEO was not consistent across implementations. In practice, CEO was often less portable than native character order. For
example, it was common for the CEOs of two implementation-supplied locales to disagree, even if both locales were named
"da_DK".

Because of these problems, some implementations of regular expressions continued to use native character order. Others used the
collation sequence, which is more consistent with sorting than either CEO or native order, but which departs further from the
traditional POSIX semantics because it generally requires "[a-e]" to match either 'A' or 'E' but not
both. As a result of this kind of implementation variation, programmers who wanted to write portable regular expressions could not
rely on the ISO POSIX-2:1993 standard guarantees in practice.

While revising the standard, lengthy consideration was given to proposals to attack this problem by adding an API for querying
the CEO to allow user-mode matchers, but none of these proposals had implementation experience and none achieved consensus. Leaving
the standard alone was also considered, but rejected due to the problems described above.

The current standard leaves unspecified the behavior of a range expression outside the POSIX locale. This makes it clearer that
conforming applications should avoid range expressions outside the POSIX locale, and it allows implementations and compatible
user-mode matchers to interpret range expressions using native order, CEO, collation sequence, or other, more advanced techniques.
The concerns which led to this change were raised in IEEE PASC interpretation 1003.2 #43 and others, and related to ambiguities in
the specification of how multi-character collating elements should be handled in range expressions. These ambiguities had led to
multiple interpretations of the specification, in conflicting ways, which led to varying implementations. As noted above, efforts
were made to resolve the differences, but no solution has been found that would be specific enough to allow for portable software
while not invalidating existing implementations.

The standard developers recognize that collating elements are important, such elements being common in several European
languages; for example, 'ch' or 'll' in traditional Spanish; 'aa' in several Scandinavian languages.
Existing internationalized implementations have processed, and continue to process, these elements in range expressions. Efforts
are expected to continue in the future to find a way to define the behavior of these elements precisely and portably.

The ISO POSIX-2:1993 standard required "[b-a]" to be an invalid expression in the POSIX locale, but this
requirement has been relaxed in this version of the standard so that "[b-a]" can instead be treated as a valid expression
that does not match any string.

The standard specifies three possible behaviors for regular expressions such as "[:alpha:]". One behavior is the
traditional implementation, which behaves like "[:ahlp]". Another, for alignment with the tr utility, is to treat it like "[[:alpha:]]". And finally, the standard allows
rejecting the regular expression as invalid, as a means of alerting a user to the non-portable aspect of that regular expression.
The set of regular expressions with this undefined behavior is limited solely to the expressions where the outer '[' and
']' of the bracket expression can be confused with the missing bracket pair '[' and ']' necessary to
form a collating symbol, equivalence class, or character class; thus "[_:alpha:]" or "[::]" do not trigger the
unspecified behavior.

BREs Matching Multiple Characters

The limit of nine back-references to subexpressions in the RE is based on the use of a single-digit identifier; increasing this
to multiple digits would break historical applications. This does not imply that only nine subexpressions are allowed in REs. The
following is a valid BRE with ten subexpressions:

The standard developers regarded the common historical behavior, which supported "\n*", but not
"\n\{min,max\}", "\(...\)*", or "\(...\)\{min,max\}", as a non-intentional result of a specific
implementation, and they supported both duplication and interval expressions following subexpressions and back-references.

The changes to the processing of the back-reference expression remove an unspecified or ambiguous behavior in the Shell and
Utilities volume of POSIX.1-2017, aligning it with the requirements specified for the regcomp() expression, and is the result of PASC Interpretation 1003.2-92 #43 submitted for
the ISO POSIX-2:1993 standard.

BRE Precedence

There is no additional rationale provided for this section.

BRE Expression Anchoring

Often, the <dollar-sign> is viewed as matching the ending <newline> in text files. This is not strictly true; the
<newline> is typically eliminated from the strings to be matched, and the <dollar-sign> matches the terminating null
character.

The ability of '^', '$', and '*' to be non-special in certain circumstances may be confusing to
some programmers, but this situation was changed only in a minor way from historical practice to avoid breaking many historical
scripts. Some consideration was given to making the use of the anchoring characters undefined if not escaped and not at the
beginning or end of strings. This would cause a number of historical BREs, such as "2^10", "$HOME", and
"$1.35", that relied on the characters being treated literally, to become invalid.

However, one relatively uncommon case was changed to allow an extension used on some implementations. Historically, the BREs
"^foo" and "\(^foo\)" did not match the same string, despite the general rule that subexpressions and entire BREs
match the same strings. To increase consensus, POSIX.1-2017 has allowed an extension on some implementations to treat these two
cases in the same way by declaring that anchoring may occur at the beginning or end of a subexpression. Therefore, portable
BREs that require a literal <circumflex> at the beginning or a <dollar-sign> at the end of a subexpression must escape
them. Note that a BRE such as "a\(^bc\)" will either match "a^bc" or nothing on different systems under the
rules.

ERE anchoring has been different from BRE anchoring in all historical systems. An unescaped anchor character has never matched
its literal counterpart outside a bracket expression. Some implementations treated "foo$bar" as a valid expression that
never matched anything; others treated it as invalid. POSIX.1-2017 mandates the former, valid unmatched behavior.

Some implementations have extended the BRE syntax to add alternation. For example, the subexpression "\(foo$\|bar\)"
would match either "foo" at the end of the string or "bar" anywhere. The extension is triggered by the use of the
undefined "\|" sequence. Because the BRE is undefined for portable scripts, the extending system is free to make other
assumptions, such that the '$' represents the end-of-line anchor in the middle of a subexpression. If it were not for the
extension, the '$' would match a literal <dollar-sign> under the rules.

As with BREs, the standard developers decided to make the interpretation of escaped ordinary characters undefined.

The <right-parenthesis> is not listed as an ERE special character because it is only special in the context of a preceding
<left-parenthesis>. If found without a preceding <left-parenthesis>, the <right-parenthesis> has no special
meaning.

The interval expression, "{m,n}", has been added to EREs. Historically, the interval expression has only been
supported in some ERE implementations. The standard developers estimated that the addition of interval expressions to EREs would
not decrease consensus and would also make BREs more of a subset of EREs than in many historical implementations.

It was suggested that, in addition to interval expressions, back-references ( '\n' ) should also be added to EREs. This
was rejected by the standard developers as likely to decrease consensus.

In historical implementations, multiple duplication symbols are usually interpreted from left to right and treated as additive.
As an example, "a+*b" matches zero or more instances of 'a' followed by a 'b'. In POSIX.1-2017,
multiple duplication symbols are undefined; that is, they cannot be relied upon for conforming applications. One reason for this is
to provide some scope for future enhancements.

The precedence of operations differs between EREs and those in lex; in lex, for historical reasons, interval expressions have a lower precedence than
concatenation.

ERE Precedence

ERE Expression Anchoring

The grammars are intended to represent the range of acceptable syntaxes available to conforming applications. There are
instances in the text where undefined constructs are described; as explained previously, these allow implementation extensions.
There is no intended requirement that an implementation extension must somehow fit into the grammars shown here.

The BRE grammar does not permit L_ANCHOR or R_ANCHOR inside "\(" and "\)" (which implies that '^' and
'$' are ordinary characters). This reflects the semantic limits on the application, as noted in XBD BRE Expression Anchoring. Implementations are permitted to extend the
language to interpret '^' and '$' as anchors in these locations, and as such, conforming applications cannot use
unescaped '^' and '$' in positions inside "\(" and "\)" that might be interpreted as
anchors.

BRE/ERE Grammar Lexical Conventions

RE and Bracket Expression Grammar

The removal of the Back_open_parenBack_close_paren option from the nondupl_RE specification is the result
of PASC Interpretation 1003.2-92 #43 submitted for the ISO POSIX-2:1993 standard. Although the grammar required support for
null subexpressions, this section does not describe the meaning of, and historical practice did not support, this construct.