Edges to RubiesThe Complete SketchUp Tutorial

Appendix RE—Regular Expressions in JavaScript and Ruby

A "Regular Expression" (aka regex), is a formal way of describing sequences of characters in strings. They are useful when other string manipulation functions run out of gas. In most languages there is a function that will answer, "Does this string contain the substring 'xyz'?" But you might want to know, "Does this string contain the substring 'network' following anywhere after 'slow'?"

Regex are powerful, but almost totally unreadable. I don't recommend them where there is a reasonable alternative. Things they are good at include advanced search and replace and string splitting.

While he did not invent regular expressions, Perl's originator Larry Wall was the first to bring regular expressions up to the syntactic front of the class, enclosing them in "/" delimiters. Ruby and JavaScript adopted Perl's regular expression syntax.

An Example

The SketchUp API will return a list of all the keyboard shortcuts. The form it returns is a key combination, a tab character and the associated command. "B", "Ctrl+B", "Backspace" or "Shift+Ctrl+W" are examples of key combinations. "Camera/Zoom Window" is the command following "Ctrl+Shift+W". This regular expression will match that pattern: /[^\t]*\t.*/. That is:

/—start of regex

[^\t]—a character class matching any character other than the tab character

*—the preceding class repeated zero or more times

\t—the tab character

.*—any character, repeated zero or more times

/—end of regex

You can enclose parts of a regex in parentheses. After a match, variables named $1, $2, etc. hold the characters matched within the first, second and so on sets of parentheses. This bit of Ruby shows a regex used to break apart the keyboard shortcut:

# $1 all characters before the first tab
# $2 all characters after the first tab
/([^\t]*)\t(.*)/.match( shortcut )
keycombo = $1
command = $2

Ruby, like Perl before it, creates global variables named $1 and so on. JavaScript creates a global object, RegExp and attaches those exporters to it. The JavaScript equivalent would be keycombo = RegExp.$1. (And just to keep you on your toes, in Ruby match() is a method of the regex class, so regex.match( string ). In JavaScript, match() is a method of the String class so string.match( regex );.

Note: a regex without comments explaining what it does should never be coded. Another note: this appendix describes a subset of regular expressions that is common across many languages, including Ruby and JavaScript.

You've now seen many of the features of standard regular expressions including escaped characters (the \t is the tab character), metacharacters (the . matches any character), character classes ([^\t] matches any character except the tab), multipliers (the * says "zero or more" of the previous character or character class) and parenthesized subexpressions.

Simple Regex

Character Classes

/d[iou]g/ matches "dig", "dog" and "dug". A character class matches a single character that is part of the class. If the class starts with a caret, the class matches any character except those in the class: [^\t] matches any character except the tab character. A hyphen may be used to indicate a range: [a-z] matches lowercase alphabetics.

Certain character classes are predefined and assigned to escaped lowercase alphabetics:

\d—[0-9]

\s—whitespace

\w—[a-zA-Z0-9_]

Uppercase equivalents match any character except those matched by the corresponding lowercase:

\D—[^0-9]

\S—not whitespace

\W—[^a-zA-Z0-9_]

Multipliers

You saw * indicating zero or more. + matches one or more: /Ya+y!/ matches "Yay!", "Yaay!", "Yaaay!", and so on. /Ya{1,3}y!/ is similar, but it won't match four or more "a"s. This is the standard multipliers list:

*—zero or more

+—one or more

?—zero or one

{n}—exactly n

{n,}—n or more

{n,m}—n through m

Or, ^ and $

The | metacharacter can be thought of as the "or" operator. a|b would match an "a" or a "b" (and is probably a bad way of saying [ab]). It is useful with groups: /(cat)|(dog)|(python)/. In /(cat)|(dog)|(python)/.match( 'python' ), $3 == 'python'. (In JavaScript, the match method is named exec.) The first two match variables are nil in Ruby, empty strings in JavaScript.

The "^" character matches the beginning of the string, the $ matches the end of the string. If you were processing sentences, /.*\?$/ matches sentences that end with the "?" character. (In this example, escaping the "?" may not be necessary. Every language with regex support has a much longer list of literals than the short list presented here, so adding the backslash is a sensible precaution.)

Metacharacters

Metacharacters are characters that represent something other than themselves as characters. You've already seen most of them. The "\" in "\t" is a metacharacter. One set of metacharacters applies within character classes; another set applies outside character classes. You may precede any metacharacter with a back slash to use it as a literal: \/ is a forward slash, not an "end of regex" metacharacter.