Table of Contents

Regular Expressions

Intro

In theoretical computer science, a regular expression is a sequence of characters that define a search pattern. It's basically a fancy way of doing text searches. Very useful in combination with sed and awk

Basic Usage

If you wanted to search through a long file looking for email addresses you might do something like

grep -E “[a-z]+@[a-z]+\.(com|org)” file.txt

That looks like someone's mashed their face on the keyboard so lets break it down into separate components to make it easier to understand.

+ this means the preceding element gets matched one or more times (so multiple letters)

@ matches the character found in the middle of an email address

[a-z]+ same again, a sequence of one or more lowercase characters

\. this tells it to use the actual . character instead of using it as a metacharacter

(com|org) the brackets get interpreted as a subexpression. in this case 'com' or 'org

This is obviously just an example to show you some features of regex. An RFC 822 compliant regex is unreadable. In practice I'd probably just do \w+@\w+\.\w (word@word.word)

BRE vs ERE

This tutorial assumes you're using ERE (Extended Regular Expressions). Basic, or BRE, is just the same but you have to backslash brackets and you can't use ?,+ or |. That's also why we used grep -E instead of grep -e

List of Metacharacters

Metacharacter

Description

.

Matches any single character (whether this includes newlines sometimes depends on the application)

[ ]

A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches “a”, “b”, or “c”. [a-z] specifies a range which matches any lowercase letter from “a” to “z”.

[^ ]

Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a”, “b”, or “c”. [^a-z] matches any single character that is not a lowercase letter from “a” to “z”.

^

Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

$

Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.

( )

Defines a marked subexpression. The string that gets matched in the parentheses can be recalled later but that's a bit more advanced.

|

The choice (also known as alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc|def matches “abc” or “def”.