This page requires JavaScript to run properly. Please enable it to get a satisfying user-experience.

stringi

Documentation

stringi-search-regex {stringi}

R Documentation

Regular Expressions in stringi

Description

A regular expression is a pattern describing, possibly in a very
abstract way, a text fragment.
With so many regex functions in stringi,
regular expressions may be a very powerful tool
to perform string searching, substring extraction, string splitting, etc.,
tasks.

Details

All stri_*_regex functions in stringi use
the ICU regex engine. Its settings may be tuned up (for example
to perform case-insensitive search) via the
stri_opts_regex function.

Regular expression patterns in ICU are quite similar in form and
behavior to Perl's regexes. Their implementation is loosely inspired
by JDK 1.4 java.util.regex.
ICU Regular Expressions conform to the Unicode Technical Standard #18
(see References section) and its features are summarized in
the ICU User Guide (see below). A good general introduction
to regexes is (Friedl, 2002).
Some general topics are also covered in the R manual, see regex.

ICU Regex Operators at a Glance

Here is a list of operators provided by the
ICU User Guide on regexes.

|

Alternation. A|B matches either A or B.

*

Match 0 or more times. Match as many times as possible.

+

Match 1 or more times. Match as many times as possible.

?

Match zero or one times. Prefer one.

{n}

Match exactly n times.

{n,}

Match at least n times. Match as many times as possible.

{n,m}

Match between n and m times.
Match as many times as possible, but not more than m.

*?

Match 0 or more times. Match as few times as possible.

+?

Match 1 or more times. Match as few times as possible.

??

Match zero or one times. Prefer zero.

{n}?

Match exactly n times.

{n,}?

Match at least n times, but no more than required
for an overall pattern match.

{n,m}?

Match between n and m times. Match as few times
as possible, but not less than n.

*+

Match 0 or more times. Match as many times as possible
when first encountered, do not retry with fewer even if overall match fails
(Possessive Match).

++

Match 1 or more times. Possessive match.

?+

Match zero or one times. Possessive match.

{n}+

Match exactly n times.

{n,}+

Match at least n times. Possessive Match.

{n,m}+

Match between n and m times. Possessive Match.

(...)

Capturing parentheses. Range of input that matched
the parenthesized sub-expression is available after the match,
see stri_match.

(?:...)

Non-capturing parentheses. Groups the included pattern,
but does not provide capturing of matching text. Somewhat more efficient
than capturing parentheses.

(?>...)

Atomic-match parentheses. First match of the
parenthesized sub-expression is the only one tried; if it does not lead to
an overall pattern match, back up the search for a match to a position
before the (?>.

(?#...)

Free-format comment (?# comment ).

(?=...)

Look-ahead assertion. True if the parenthesized
pattern matches at the current input position, but does not advance
the input position.

(?!...)

Negative look-ahead assertion. True if the
parenthesized pattern does not match at the current input position.
Does not advance the input position.

(?<=...)

Look-behind assertion. True if the parenthesized
pattern matches text preceding the current input position, with the last
character of the match being the input character just before the current
position. Does not alter the input position. The length of possible strings
matched by the look-behind pattern must not be unbounded (no *
or + operators.)

(?<!...)

Negative Look-behind assertion. True if the
parenthesized pattern does not match text preceding the current input
position, with the last character of the match being the input character
just before the current position. Does not alter the input position.
The length of possible strings matched by the look-behind pattern must
not be unbounded (no * or + operators.)

(?<name>...)

Named capture group. The <angle brackets>
are literal - they appear in the pattern.

(?ismwx-ismwx:...)

Flag settings. Evaluate the parenthesized
expression with the specified flags enabled or -disabled,
see also stri_opts_regex.

(?ismwx-ismwx)

Flag settings. Change the flag settings.
Changes apply to the portion of the pattern following the setting.
For example, (?i) changes to a case insensitive match,
see also stri_opts_regex.

ICU Regex Meta-characters at a Glance

Here is a list of meta-characters provided by the
ICU User Guide on regexes.

\a

Match a BELL, \u0007.

\A

Match at the beginning of the input. Differs from ^.
in that \A will not match after a new line within the input.

\b

Match if the current position is a word boundary.
Boundaries occur at the transitions between word (\w) and non-word
(\W) characters, with combining marks ignored. For better word
boundaries, see ICU Boundary Analysis, e.g., stri_extract_all_words.

\B

Match if the current position is not a word boundary.

\cX

Match a control-X character.

\d

Match any character with the Unicode General Category of
Nd (Number, Decimal Digit.).

\D

Match any character that is not a decimal digit.

\e

Match an ESCAPE, \u001B.

\E

Terminates a \Q ... \E quoted sequence.

\f

Match a FORM FEED, \u000C.

\G

Match if the current position is at the end of the
previous match.

\h

Match a Horizontal White Space character.
They are characters with Unicode General Category of Space_Separator plus
the ASCII tab, \u0009. [Since ICU 55]