Contents

You want to test to see if a text matches a specific pattern of characters You want to replace patterns of text with other patterns. You have text with repeating patterns and you would like to break the text up into discrete items.

Regular expressions ("regex") are a field unto itself. If you wish to derive full benefit from this way of describing strings with patterns, you should consult a separate introduction. Priscilla Walmsley's XQuery (Chapter 18) has a clear summary of the functionality offered.

fn:matches($input, $regex, $flags) takes a string and a regular expression as input. If the regular expression matches any part of the string, the function returns true. If it does not match, it returns false. Enclose the string with anchors (^ at the beginning and $ at the end), if you only want the function to return true when the pattern matches the entire string. Note that this is different than the XML Schema patterns where ^ and $ are implied.

fn:replace($input, $regex, $string, $flags) takes a string, a regular expression, and a replacement string as input. It returns a new string that is the string with all matches of the pattern in the input string replaced with the replacement string. You can use $1 to $99 to re-insert groups of characters captured with parentheses into the replacement string.

fn:tokenize($input, $regex, $flags) returns an array of strings that consists of all the substrings in the input string between all the matches of the pattern. The array will not contain the matches themselves.

In regular expressions, most characters represent themselves, so you are not obliged to use the special regex syntax in order to utilise these three functions. In regular expressions, a dot (.) represents all characters except newlines. Immediately following a character or an expression such as a dot, one can add a quantifier which tells how many times the character should be repeated: "*" for "0, 1 or many times" "?" for "0 or 1 times," and "+" for "1 or many times." The combination "*?" replaces the shortest substring that matches the pattern. NB: this only scratches the surface of the subject of regular expressions!

The three functions all accept optional flag parameters to set matching modes. The following four flags are available:

i makes the regex match case insensitive.

s enables "single-line mode" or "dot-all" mode. In this mode, the dot matches every character, including newlines, so the string is treated as a single line.

m enables "multi-line mode". In this mode, the anchors "^" and "$" match before and after newlines in the string as well in addition to applying to the string as a whole.

x enables "free-spacing mode". In this mode, whitespace in regex pattern is ignored. This is mainly used when one has divided a complicated regex over several lines, but do not intend the newlines to be matched.

If one do not use a flag, one can just leave the slot empty or write "".

In the second example, "\s" represents one whitespace character and thus matches the newline before "orange" and the tab character before "yellow". It is quantified with "*" so the pattern removes whitespace after the comma, but not before it. To remove all whitespace, use the pattern '\s*,\s*'.

In the last example, "\d" represents any digit; the parenthesis around "\d" binds the variable "$1" to whatever digit it matches; in the replacement string, this variable is replaced by the matched digit.