Ruby, like many other languages, contains a powerful
text-processing shortcut that looks like it was created by cats walking on
the keyboard. Regular expressions can be very difficult to read,
especially as they grow longer, but they offer tremendous power that’s
hard to re-create in Ruby code. As long as you stay within a modest subset
of regular expressions, you can get a lot done without confusing
anyone—yourself included—who’s trying to make sense out of your program
logic.

This excerpt is from Learning Rails .
Most Rails books are written for programmers looking for information on data structures. Learning Rails targets web developers whose programming experience is tied directly to the Web. Rather than begin with the inner layers of a Rails web application -- the models and controllers -- this unique book approaches Rails development from the outer layer: the application interface. You can start from the foundations of web
design you already know, and then move more deeply into Ruby, objects, and database structures.

What Regular Expressions Do

Regular expressions help your programs find chunks of text that
match patterns you specify. Depending on how you call the regular
expression, you may get:

A yes/no answer

Something matched or it didn’t

A set of matches

All of the pieces that matched your query, so you can sort
through them

A new string

If you specified that this was a search-and-replace
operation, you may have a new string with all of the replacements
made

Regular expressions also offer incredible flexibility in
specifying search terms. A key part of the reason that regular
expressions look so arcane is that they use symbols to specify different
kinds of matches, and matches on characters that aren’t easily
typed.

These samples all use regular expressions in their simplest
typical use case: testing to see whether a string contains a pattern.
Each of these will test :secret
against the expression specified by :with. If the pattern in :with matches, then validation passes. If not,
then validation fails and the :message will be returned. Removing the Rails
trim, the first of these could be stated roughly in Ruby as:

if :secret =~ /[0-9]/
#yes, it's there
else
#no, it's not
end

The =~ is Ruby’s way of
declaring that the test is going to compare the contents of the left
operand against the regular expression on the right side. It doesn’t
actually return true or false, though—it returns the numeric position
at which the first match begins, if there is a match, and nil if there are none. You can treat it as a
boolean evaluator, however, because nil always behaves as false in a boolean evaluation, and other
non-false values are the same as
true.

Note

There isn’t room here to explain them, but if you need to do
more with regular expressions than just testing whether there’s a
match, you’ll be interested in the $~ variable (or Regexp.last_match), which gives you access
to more detail on the results of the matching. A variety of methods on
the String object, notably sub, gsub, and slice, also use regular expressions for
slicing and dicing. You can also retrieve match results with $1 for the first match, $2 for the second, and so on, variables
created by the match.

There’s one other feature in these simple examples worth a little
more depth. Reading them, you might have thought that /[0-9]/ was a regular expression. It’s a
regular expression object, but the expression itself is [0-9]. Ruby uses the forward slash as a
delimiter for regular expressions, much like quotes are used for
strings. Unlike strings, though, you can add flags after the closing
slash, as you’ll see later.

If you’d prefer, you can also use Regexp.new to create regular expression
objects. (This usually makes sense if your code needs to meet changing
circumstances on the fly at runtime.)

The Simplest Expressions: Literal Strings

The simplest regular expressions are simply literal strings. There are
plenty of times when it’s enough to search against a fixed search
pattern. For example, you might test for the presence of the string
“Ruby”:

sentence = "Ruby is the best Ruby-like programming language."
sentence =~ /Ruby/
# => 2 - There are two instances of "Ruby".

Character Classes

Example C.1, “Validating data against regular expressions”tested against letters and numbers, but there are many
ways to do that. [a-z] is a good way
to test for lowercase letters in English, but many languages use
characters outside of that range. Regular expression character classes
let you create sets of characters as well as use predefined groups of
characters to identify what you want to target.

To create your own character class, use the square braces:
[ and ]. Within the square braces, you can either
list the characters you want, or create a set of characters with the
hyphen. To match all the (guaranteed) English vowels in lowercase, you
would write:

/[aeiou]/

If you wanted to match both upper- and lowercase vowels, you could
write:

/[aeiouAEIOU]/

(If you wanted to ignore case entirely in your search, you could
also use the i modifier described
earlier: /[aeiou]/i.)

You can also mix character classes in with other parts of a
search:

/[Rr][aeiou]by/

That would match Ruby, ruby, raby,
roby, and a lot of other variations
with upper- or lowercase R, followed
by a lowercase vowel, followed by by.

Sometimes listing all the characters in a class is a hassle.
Regular expressions are difficult enough to read without huge chunks of
characters in classes. So instead of:

/[abcdefghijklmnopqrstuvwxyz]/

you can just write:

/[a-z]/

As long as the characters you want to match form a single range,
that’s simple—the hyphen just means “everything in between.”

There’s also a “not” option available, in the ^ character. You can reverse /[aeiou]/ by writing:

Escaping

Of course, even in simple strings there can be a large
problem: lots of characters you’ll want to test for are used by regular
expression engines with a different meaning. The square braces around
[0-9] are helpful for specifying that
it’s a set starting with zero and going to nine, but what if you’re
actually searching for square braces?

Fortunately, you can “escape” any character that regular
expressions use for something else by putting a backslash in front of
it. An expression that looks for left square brackets would look like
\[. If you need to include a
backslash, just put a second backslash in front of it, as in \\.

Some characters, particularly whitespace characters, are also just difficult to
represent in a string without creating strange formatting. Table C.2, Escapes for whitespace characters” shows how to escape them
for convenient matching.

Table C.2. Escapes for whitespace characters

Escape sequence

Meaning

\f

Form feed character

\n

Newline character

\r

Carriage return character

\t

Tab character

Modifiers

Sometimes you want to be able to search for strings without regard
to case, and you don’t want to put a lot of effort into creating an
expression that covers every option. Other times you want to search
against a string that contains many lines of text, and you don’t want
the expression to stop at the first line. For these situations, where
the underlying rules change, Ruby supports modifiers, which you can put
at the end of the expression or specify through the Regexp object. A complete list of modifiers is
shown in Table C.3, Regular expression modifier options”.

Table C.3. Regular expression modifier options

Modifier
character

Effect

i

Ignore case completely.

m

Multiline matching—look past the first newline, and
allow . and \n to match newline
characters.

x

Use extended syntax, allowing whitespace and
comments in expressions. (Probably not the first thing you want
to try!)

o

Only interpolate #{} expressions the first time the
regular expression is evaluated. (Again, unlikely when starting
out.)

u

Treat the content of the regular expression as
Unicode. (By default, it is treated as the same as the content
it is tested against.)

e, s, n

Treat the content of the regular expression as EUC,
SJIS, and ASCII, respectively, like u does for Unicode.

Of these, i and m are the only ones you’re likely to use at
the beginning. To use them in a regular expression literal, just add
them after the closing \:

If you want to use multiple options, you can. /ruby/iu specifies case-insensitive Unicode matching, for instance.

Anchors

Sometimes you want a match to be meaningful only at an edge: the
start or the end, or maybe a word in the middle. You might even want to
define your own edge—something is important only when it’s next to
something else. Ruby’s regular expression engine lets you do all of
these things, as well as match only when your match is
not against an edge. Table C.4, Regular expression anchors” lists common anchor
syntax.

Table C.4. Regular expression anchors

Syntax

Meaning

^

When at the start of the expression, means to match
the expression only against the start of the target (or a line
within the target, when multiline matching
is on).

$

When at the end of the expression, means to match
the expression only against the end of the target (or the end of
a line within the target, when multiline
matching is on).

\A

When at the start of the expression, means to match
the expression only against the start of the target string,
not lines within it.

\Z

When at the end of the expression, means to match
the expression only against the end of the target string,
not lines within it.

\b

Marks a boundary between words, up against
whitespace.

\B

Marks something that isn’t a boundary between
words.

(?=expression)

Lets you define your own boundary, by limiting the
match to things next to expression.

(?!expression)

Lets you define your own boundary, by limiting the
match to things that are not next to
expression.

These make a little more sense if you see them in action. For
example, if you only want to match “The” when it’s at the start of a
line, you could write:

/^The/

If you wanted to match “1991” when it’s at the end of a line, you
could write:

/1991$/

If multiline matching was on, and you wanted to make sure these
matches apply only at the start or end of the string, you would write
them as:

/\AThe/
/1991\Z/

The \b anchor is really useful
when you want to match a word, not places where a sequence falls in the
middle of a word. For example, if you wanted to match “the” without
matching “Athens” or “Promethean,” you could write:

/\bthe\b/

Alternately, if you wanted to match “the”
only when it was part of another word, you could
use \B to write:

Sequences, Repetition, Groups, and Choices

Specifying a simple match pattern may take care of most of what you
need regular expressions for use in Rails, but there are a few
additional pieces you should know about before moving on. Even if you
don’t match something that needs these, knowing what they look like will
help you read other regular expressions when you encounter them.

There are three classic symbols that indicate whether an item is
optional or can repeat, plus a notation that lets you specify how much
something should repeat, as shown in Table C.5, Options and repetition”.

Table C.5. Options and repetition

Syntax

Meaning

?

The pattern right before it should appear 0 or 1
times.

*

The pattern right before it should appear 0 or more
times.

+

The pattern right before it should appear 1 or more
times.

{number}

The pattern before the opening curly brace should
appear exactly number
times.

{number,}

The pattern before the opening curly brace should
appear at least number
times.

{number1, number2}

The pattern before the opening curly brace should
appear at least number1 times but no
more than number2
times.

You might think you’re ready to go create expressions armed with
this knowledge, but you’ll find some unpleasant surprises. The regular
expression:

/1998+/

might look like it will match one or more instances of “1998”, but it
will actually match “199” followed by one or more instances of “8”. To
make it match a sequence of 1998s, you would write:

/(1998)+/

If you wanted to specify, say, two to five occurrences of 1998,
you’d write:

/(1998){2,5}/

The parentheses can also be helpful when specifying choices,
though for a slightly different reason. If you wanted to match, say,
2013 or 2014, you could use | to
write:

/2013|2014/

The | divides the whole
expression into complete expressions to its left or right, rather than
just grabbing the previous character, so you don’t need parentheses
around either 2013 or 2014. Nonetheless, if you wanted to do some thing
like match 2013, 2014, or 2017, you might not want to write:

/2013|2014|2017/

You could instead write something more like:

/201(3|4|7)/

Note

Parentheses also “capture” matched text for later use, and that
capturing may determine how you structure parentheses. It’s probably
not the first place you’ll want to start, though.

Greed

There’s one last feature of the repetition operators that can
cause unexpected results: by default, they’re
greedy. This isn’t a question of computing virtue,
but rather one of how much content a regular expression can match at one
go. This is a common issue in things like HTML, where you might see
something like:

<a href= "http://example.com" >Example.com</a>

You might think you could match the HTML tags simply with an
expression like:

/<.*>/

But instead of matching the opening tag and closing tag
separately, that expression will grab everything from the opening
< to the closing > of </a>, because it can. If you want to
restrain a given expression so that it takes the smallest possible
matching bite, add a ? behind any of
the repetition operators:

/<.*?>/

Greed matters more when you use regular expressions to extract
content from long strings, but it can yield confusing results even in
supposedly simple matching. If you have mysterious problems, greed is a
good thing to check for.

More

Regular expressions have nearly infinite depth, and this appendix
has barely begun to scratch the surface, either of expressions or the
ways you can use them in Ruby and Rails. A few of the things this
incredibly brief guide hasn’t been able to include are:

Using expressions to fragment a string into smaller
pieces

Referencing earlier matches later in an expression

Creating named groups

Commenting regular expressions

A variety of special syntax forms using parentheses

Again, for a much more comprehensive guide to regular expressions,
see Jeffrey E. F. Friedl’s classic Mastering Regular
Expressions or Tony Stubblebine’s compact but extensive
Regular Expression Pocket Reference. For more on
using them specifically with Ruby, see The Ruby Programming
Language, by David Flanagan and Yukihiro Matsumoto
(O’Reilly).