Regular Expressions

One class of objects that some object oriented programming languages support is called Regular Expressions. These are basically patterns that can be used for string manipulation.

The typical syntax for a regular expression involves placing the pattern code between slashes with optional characters after the trailing slash. A trailing i indicates that case should be ignored when processing the pattern and g indicates that the pattern should be processed global instead of stopping at the first occurrence, gi would indicate that the pattern should be processed globally ignoring case.

The supported methods will depend on the programming language but a regular expression class will probably support the following methods and possibly many more:

one that when passed a string will return true if the pattern matches the string or false if it doesn't

one that will extract the matching elements from the string and return them in an array

one that will replace matching substrings with a different string

one that will split the string wherever a match is found

The most complex part of using regular expressions is in determining how to code the patterns (the part between the slashes). A number of characters and character combinations have special meanings, all other characters are expected to match exactly with the characters in the string. Where a single character has a special meaning that special meaning can be overridden by preceding it by a back slash. The characters and character combinations with special meanings are as follows:

^ indicates the start of the string, if this is the first character in the pattern then what follows must match against the very beginning of the string.

$ indicates the end of the string, if this is the last character then what precedes it must match against the very end of the string.

\b matches any word boundary, the string must contain one or more white space characters at the corresponding position (space, new line, form feed, carriage return, or tab).

\B matches any non-word boundary, the character in this position can be anything except white space.

\n matches a new line character.

\f matches a form feed character.

\r matches a carriage return character.

\t matches a horizontal tab.

\v matches a vertical tab.

\ooo matches the ASCII character represented by the octal number ooo.

\xhh matches the ASCII character represented by the hexadecimal number hh.

\ucccc matches the unicode character represented by the hexadecimal number uuuu.

. matches any character except a new line (or unicode equivalent).

\w matches any alphanumeric character (or underscore).

\W matches any character that is not an alphanumeric or underscore.

\d matches any digit (number).

\D matches any character except numbers.

\s matches any single whitespace character.

\S matches any single character except space, new line, form feed, carriage return, and tab.

? makes the preceding character optional, eg. ab?c will match both abc and ac.

* matches on zero or more occurrences of the preceding character eg. ab*c will match both ac and abbbbbc.

+ matches on one or more of the preceding character.

{n} matches exactly n occurrences of the preceding character eg. \d{3} matches any three digit number.

{n,p} matches between n and p occurrences eg. \d{1,5} will match any number between 0 and 99999.

{n,} matches n or more occurrences of the preceding character.

()\n (where n is between 1 and 9) matches the parenthesized content against a previous match \1 matches the immediately preceding match and \9 matches against the ninth match back eg. (\w+)\s+\1 would match any word that appears twice in a row.

To apply those modifiers that affect the preceding character to a larger block of characters you surround the block with parenthesis () so a(bc)?d would match both abcd and ad. To provide alternative characters each of which can be matched you use | so ab|cd would match both abd and acd.

Regular expressions are a very powerful string manipulation tool and programming languages that support regular expressions can easily perform find, replace, and other manipulations of string data in a minimum of coding.

As an example of how you can use regular expressions for string manipulation, let's consider a date which we expect to consist of one or two digits followed by a separator then one or two more digits, a second copy of the same separator and finally four more digits (this works both for regular dates and the US format that reverses the day and month fields). Let's work it out one piece at a time. To test that we begin with a one or two digit number we use ^\d{1,2} which tests that the first one or two characters is numeric. Valid separators are / - or . so to test for the first separator character we use -|\/|\. (note that . needs to be preceded by \ to override its special meaning of any character except new line and restore its normal meaning of dot. / also needs to be preceded by \ as otherwise it would be taken to be the pattern terminator). To check that the second separator matches the first we surround the code for the first separator in parenthesis and specify \1 as the match criteria for the second occurrence. Finally we use \d(4)$ to check that there are four digits at the end for the year. So our final regular expression for testing dates is /^\d{1,2}(-|\/|\.)d{1,2}\1\d(4)$/ and this pattern will match any string that contains a date. Note that in this example we have not validated the individual numbers within their appropriate ranges so it would still accept 3/17/2003 as being a valid date even though there are only twelve months in a year.