This is machine translation

Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Translate This Page

MathWorks Machine Translation

The automated translation of this page is provided by a general purpose third party translator tool.

MathWorks does not warrant, and disclaims all liability for, the accuracy, suitability, or fitness for purpose of the translation.

Tokens in Regular Expressions

Introduction

Parentheses used in a regular expression not only group elements
of that expression together, but also designate any matches found
for that group as tokens. You can use tokens
to match other parts of the same text. One advantage of using tokens
is that they remember what they matched, so you can recall and reuse
matched text in the process of searching or replacing.

Each token in the expression is assigned a number, starting
from 1, going from left to right. To make a reference to a token later
in the expression, refer to it using a backslash followed by the token
number. For example, when referencing a token generated by the third
set of parentheses in the expression, use \3.

As a simple example, if you wanted to search for identical sequential
letters in a character array, you could capture the first letter as
a token and then search for a matching character immediately afterwards.
In the expression shown below, the (\S) phrase
creates a token whenever regexp matches any nonwhitespace
character in the character array. The second part of the expression, '\1',
looks for a second instance of the same character immediately following
the first:

For another example, capture pairs of matching HTML tags (e.g., <a> and </a>)
and the text between them. The expression used for this example is

expr = '<(\w+).*?>.*?</\1>';

The first part of the expression, '<(\w+)',
matches an opening bracket (<) followed by one
or more alphabetic, numeric, or underscore characters. The enclosing
parentheses capture token characters following the opening bracket.

The second part of the expression, '.*?>.*?',
matches the remainder of this HTML tag (characters up to the >),
and any characters that may precede the next opening bracket.

The last part, '</\1>', matches all
characters in the ending HTML tag. This tag is composed of the sequence </tag>,
where tag is whatever characters were captured
as a token.

Multiple Tokens

Here is an example of how tokens are assigned values. Suppose
that you are going to search the following text:

andy ted bob jim andrew andy ted mark

You choose to search the above text with the following search
pattern:

and(y|rew)|(t)e(d)

This pattern has three parenthetical expressions that generate
tokens. When you finally perform the search, the following tokens
are generated for each match.

Match

Token 1

Token 2

andy

y

ted

t

d

andrew

rew

andy

y

ted

t

d

Only the highest level parentheses are used. For example, if
the search pattern and(y|rew) finds the text andrew,
token 1 is assigned the value rew. However, if
the search pattern (and(y|rew)) is used, token
1 is assigned the value andrew.

Unmatched Tokens

For those tokens specified in the regular expression that have
no match in the text being evaluated, regexp and regexpi return
an empty character vector ('') as the token output,
and an extent that marks the position in the string where the token
was expected.

The example shown here executes regexp on
a character vector specifying the path returned from the MATLAB®tempdir function. The regular expression expr includes
six token specifiers, one for each piece of the path. The third specifier [a-z]+ has
no match in the character vector because this part of the path, Profiles,
begins with an uppercase letter:

When a token is not found in the text, regexp returns
an empty character vector ('') as the token and
a numeric array with the token extent. The first number of the extent
is the string index that marks where the token was expected, and the
second number of the extent is equal to one less than the first.

In the case of this example, the empty token is the third specified
in the expression, so the third token returned is empty:

tok{:}

ans =
'C:' 'WINNT' '' 'bpascal' 'LOCALS~1' 'Temp'

The third token extent returned in the variable ext has
the starting index set to 10, which is where the nonmatching term, Profiles,
begins in the path. The ending extent index is set to one less than
the starting index, or 9:

ext{:}

ans =
1 2
4 8
10 9
19 25
27 34
36 39

Tokens in Replacement Text

When using tokens in replacement text, reference them using $1, $2,
etc. instead of \1, \2, etc.
This example captures two tokens and reverses their order. The first, $1,
is 'Norma Jean' and the second, $2,
is 'Baker'. Note that regexprep returns
the modified text, not a vector of starting indices.

regexprep('Norma Jean Baker', '(\w+\s\w+)\s(\w+)', '$2, $1')

ans =
Baker, Norma Jean

Named Capture

If you use a lot of tokens in your expressions, it may be helpful
to assign them names rather than having to keep track of which token
number is assigned to which token.

When referencing a named token within the expression, use the
syntax \k<name> instead of the numeric \1, \2,
etc.: