Regular Expressions

WARNING: This is a woefully incomplete overview of regular expressions. It would be absurd to try to fully cover the topic in a short handout like this. Hopefully, this will provide some of the basics to get you started, but to really understand regular expressions, I suggest you to read as much of Mastering Regular Expressions by Jeffrey E.F. Friedl as you have time for. In addition, the regular expressions chapter in Eloquent JavaScript is more comprehensive than below.

A regular expression is a sequence of characters that describes or matches a given amount of text. For example, the sequence bob, considered as a regular expression, would match any occurance of the word “bob” inside of another text. The following is a rather rudimentary introduction to the basics of regular expressions. We could spend the entire semester studying regular expressions if we put our mind to it. Nevertheless, we’ll just have a basic introduction to them this week and learn more advanced technique as we explore different text processing applications over the course of the semester.

Regular expressions (referred to as ‘regex’ for short) have both literal characters and meta characters. In bob, all three characters are literal, i.e. the ‘b’ wants to match a ‘b’, the ‘o’ an ‘o’, etc. We might also have the regular expression:

^bob

In this case, the ‘^’ is a meta character, i.e. it does not want to match the character ‘^’, but instead indicates the “beginning of a line.” In other words the regex above would find a match in:

bob goes to the park.

but would not find a match in:

jill and bob go to the park.

Here are a few common meta-characters (I’m listing them below as they would appear in a Java regular expression, which may differ slightly from perl, php, .net, etc.) used to get us started:

Position Metacharacters:

^ beginning of line
$ end of line
\b word boundary
\B a non word boundary

Single Character Metacharacters:

. any one character
\d any digit from 0 to 9
\w any word character (a-z,A-Z,0-9)
\W any non-word character
\s any whitespace character (tab, new line, form feed, end of line, carriage return)
\S any non whitespace character

Quantifiers (refer to the character that precedes it):

? appearing once or not at all
* appearing zero or more times
+ appearing one or more times
{min,max} appearing within the specified range

Using the above, we could come up with some quick examples:

^$ –> matches beginning of line followed by end of line, i.e. match any blank line!

ing\b –> matches ‘ing’ followed by a word boundary, i.e. any time ‘ing’ appears at the end of a word!

Character Classes allow one to do an “or” statement amongst individual characters and are denoted by characters enclosed in brackets, i.e. [aeiou] means match any vowel. Using a “^” negates the character class, i.e. [^aeiou] means match any character not a vowel (note this isn’t just limited to letters, it really means anything at all that is not an a, e, i, o, or u.) A hyphen indicates a range of characters, such as [0-9] or [a-z].

Another key metacharacter is |, meaning or. This is known as the concept of Alternation.

John | Jon -> match “John” or Jon”

Note: this regex could also be written as Joh?n, meaning match “Jon” with an option “h” between the “o” and “n.”

Parentheses can also be used to constrain the alternation, i.e.:

(212|646|917)\d* matches any sequence of zero or more digits preceded by 212, 646, or 917 (presumably to retrieve phone #’s with NYC area codes). Note this regular expression would need to be improved to take into consideration white spaces and/or punctuation.

Parentheses also serve the purpose of capturing groups for back-references. For example, examine the following regular expression: \b([0-9A-Za-z]+)s+\1\b.

The first part of the expression in parentheses reads: \b([0-9A-Za-z]+) meaning match any “word” containing at least one or more letters/digits. The next part \s+ means any sequence of at least one white space. The third part \1 says match whatever you matched that was enclosed inside the first set of parentheses, i.e. ([0-9A-Za-z]+). So, thinking this over, what will this regular expression match in the following line:

This is really really super super duper duper fun. Fun!

Testing regex with egrep

grep is a unix command line utility that takes an input file, a regular expression and outputs the lines that contain matches for that regular expression. It’s a quick way for us to test some regexes. As a point of history, the name comes from the form “g/re/p” which stands for “Global Regular Expression Print.” We’ll be used egrep, which allows for more sophisticated regular expression searches. (Note: the examples below use a slightly different regex “flavor” than what we will see in JavaScript. This is something we’ll have to get used to, and will likely cause a bit of confusion. Not to worry, confusion over regular expression flavors is extremely normal. No need to seek professional help.)

The syntax is simple:

egrep -flags 'regexpattern' filename

If we want to output a file:

egrep -flags 'regexpattern' filename >> outputfilename

% egrep -i 'four' bible.txt
% egrep -i 'five' bible.txt

The -i flag indicates that the match should be case-insensitive. You can find documentation for the “egrep” command here (with a full list of flags).

Match URL’s:

Match double words:

(Note, in the above example, the metacharacter \< means “start of word boundary” and > means “end of word boundary.” This is different than the \b we’ll find in JavaScript which is the metacharacter for the beginning or end of a word (also known as a ‘word boundary’).

Regular Expressions in JavaScript

In JavaScript, regular expressions like Strings are objects. For example, a regex object can be created like so:

varregex=newRegExp('aregex');

While the above is technically correct (and sometimes necessary, we'll get to this later), a more common way to create a regex in JavaScript is with forward slashes. Whereas a String is an array of characters between quotes, a regex is an array of characters between forward slashes. For example:

varregex=/aregex/;

The RegExp object has two methods. The key method to examine is exec() which executes a search in a given String for matches of the regular expression. It returns an array of information including the matched String, the index where the String appears, and the input String (in case you forgot.)

For example:

vartext="This is a test.";// The String the search invarregex=/test/;// The regex varresults=regex.exec(text);// Execute the search

results now contains the following array:

['test',index:10,input:'This is a test.']

If the regular expression included capturing parenthese, the groups would also appear in the array. For example, let's say you needed a regex to match any phone numbers a String.

The above isn't necessarily the greatest phone number matching regex, but it'll do for this example. One or more numbers followed by a dash or period followed by one or more numbers, a dash or period again, and one or more numbers. Let's look at the resulting array.

Notice how the full phone number match appears as the first (index 0) element and the captured group (the area code) follows. You might notice, however, that there are threep phone numbers in the original input String and yet exec() only matched the first one. In order to find all the matches, we'll need to add two more steps.

Add the global flag: g.

Regular expressions can include flags that modify how the search operates. For example the flag i is for case-insensitivity so that the regular expression hello with the flag i would match “hello”, “Hello”, “HELLO”, and “hElLO” (and other permutations). A flag is added after the second forward slash like so: /hello/i. The global flag g tells the regular expression that we want to search for all of the matches and not just the first one.

varregex=/(\d+)[-.]\d+[-.]\d+/g;// Now includes the global flag

Add a while loop to continue calling exec()

The exec() function, even with the global flag, will still return only the first match. However, if we call exec() a second time with the same regex and input String, it will move on and return the results for the second match (or null if there is no match.) We can therefore write a while loop to keep checking until the result is null.

vartext='Phone numbers: 212-555-1234 and 917-555-4321 and 646.555.9876.';varregex=/(\d+)[-.]\d+[-.]\d+/g;varresults=regex.exec(text);while(results!=null){// do something with the matched results and then// Check againresults=regex.exec(text);}

This could also be written with the following shorthand (The examples linked from here, however, use the longer-winded code for clarity.)

varresults;while((results=regex.exec(text))!=null){// do something with the matched results and then}

The RegExp object also includes another method test() which simply returns true or false depending on whether or not at least one match was found.

vartext='This is a regex example.';varregex=/example/;varfound=regex.test(text);// Results in TRUE

The String object also includes methods that receive regular expression objects as arguments. For example, match() works almost identically as exec(). There are only two differences. One, the method is called on a String with a RegExp as an argument. And second, it works differently in the case of global matches. Let's look at a simple example first.

vartext="This is a test of regular expressions.";varregex=/test/;varresults=text.match(regex);

The above produces the identical result as we saw with exec() with results containing the following array.

['test',index:10,input:'This is a test of regular expressions.']

If we try to global match of phone numbers, however, we'll get different results.

Here we do not need to employ a loop and instead get an array of all the matches.

['212-555-1234','917-555-4321','646.555.9876']

This is quite a bit more convenient in many cases, however, we've lost some information. If we require the capturing group matches or the index locations of the matches, we'll need to go back to using exec() in RegExp.

Another method of the String object is search() which works just like indexOf() returning the index of the match or a -1 if there is no match.

Splitting with Regular Expressions

We can now revisit the split function we examined previously and understand how regular expressions work as a delimiter. An input String is split into an array of substrings beginning at each match of that regular expression. Here's a simple example that quickyl counts the # of words (not perfect by any means).

vartext="This text has characters, spaces, and some punctuation.";varregex=/\W+/;// one or more non-word chars (anything not a-z0-9)varwords=text.split(regex);console.log('Total words: '+words.length);

What if you, however, would like to include all the delimiters? To accomplish this, simply enclose your delimiters in capturing parentheses.With var regex = /(\W+)/; therefore you'll get the following result.

Search and Replace

Running a search and replace is one of the more powerful things one can do with regular expressions. This can be accomplished with the String's replace() method. The method receives two arguments, a regex and a replacement String. Wherever there is a regex match, it is replaced with the String provided.

vartext='Replace every time the word "the" appears with the word ze.';// \b is a word boundary// You can think of this as an invisible boundary // between a non-word character and a word character.varregex=/\bthe\b/g;varreplaced=text.replace(regex,'ze');

The result is:

Replace every time ze word "ze" appears with ze word ze.

We can also reference the matched text using a backreference to a captured group in the substitution string. A backreference to the first group is indicated as $1, $2 is the second, and so on and so forth.

vartext="Double the vowels.";varregex=/([aeiou]+)/g;varreplaced=text.replace(regex,'$1$1');