1. Introduction

Learning and understanding Regular Expressions may not be as straight forward as learning ls command. However, learning Regular Expressions and effectively implementing them in your daily work will doubtlessly reward your learning effort by greater work efficiency and time savings. Regular Expressions is a topic which can easily fill up entire 1000 pages long book. In this article, we only try to explain the basics of Regular Expressions in a concise, non-geeky and example driven manner. Therefore, if you ever wanted to learn Regular Expression basics now you have a viable chance.

The intention of this tutorial is to cover a fundamental core of Basic Regular Expressions and Extended Regular Expressions. For this, we will use a single tool,and that will be the GNU grep command. GNU/Linux operating system and its grep command recognizes three different types of Regular Expressions:

Basic Regular Expressions (BRE)

Extended Regular Expressions (ERE)

Perl Regular Expressions (PRCE)

The difference between Basic Regular Expressions and Extended Regular Expressions well be explained momentarily.

2. What is a Regular Expression

Regular Expression provides an ability to match a "string of text" in a very flexible and concise manner. Where a "string of text" can be further defined as a single character, word, sentence or particular pattern of characters. Well known abbreviations for "Regular Expression" include regex and regexp.

3. Simple Regular Expression example

The simplest building block of any regular expression is a character. We can use grep to search for any particular character from within a text of any given non-binary file. For example, here is a content of our regex.txt sample file:

$ cat regex.txt
grep stands for:
global
regular
expression
print

Now we can use grep to search for any character by providing it with a regular expression. Let's use grep to search for a character "e":

$ grep e regex.txt
grep stands for:
regular
expression

As you can see from the example above, grep printed all lines comprising of at least one "e" character. We can now combine multiple characters to form a string "regu" and use grep to search for a string in the text:

$ grep regu regex.txt
regular

To unleash the real power of regular expressions though, we need to form a regular expression from non-alphabetic ( meta-characters ) characters or from the combination of alphabetic and non-alphabetic characters. For example, what if you want to search all lines which begin with character "g"? For this we can use a caret symbol "^":

$ grep ^g regex.txt
grep stands for:
global

This was just a fundamental example of more sophisticated regular expression. In this article, we will explain more regular expression's techniques as the one above in the more detail.

4. Concatenation

As you can see on our preceding example, the simplest regular expression can consist of an individual character. Hence a regular expression consisting of a single non-special character will match any given string containing that character. The nature of Regular Expressions permits for concatenation of multiple other Regular Expressions. Which means that a set of characters such as "press" will match any string that contains a substring formed by concatenation of several regular expressions "p","r","e","s" and "s".

5. Basic vs Extended Regular Expressions

GNU grep understands both, basic and extended regular expressions. The prime difference is that in basic regular expressions, the meta-characters: ?, +, {, |, (, and ) lose their special meaning. To give meta-characters its special meaning they need to be escaped with backslash character. Think over a following example:

grep command assumes basic regular expression as a default. Therefore, the following command will print exclusively first line only considering that it contains substring "n|p":

$ grep "n|p" regex.txt
global|regular|expression|print

The "|" alteration operator has its own special meaning, and that is logical OR. However, this special meaning was suppressed in the previous example since grep by default threats any regular expression as a basic regular expression. To make grep read extended regular expressions, we need to use option -E or simply use egrep instead of grep.

In the preceding example, we used grep with extended regular expression, and thus it displays both lines, which contain n OR p character. As said previously the meta-characters lost their special meaning when expressed as basic regular expressions, unless they are escaped with "\" character. Let's re-use our first example but this time, we escape the "|" character:

In this case alteration operator "|" retains its special meaning and acts as logical OR even though we did not use -E option or egrep.

We also said that when using egrep or -E option, grep presumes to be fed with Extended Regular Expressions. Because of that, if you escape a meta character in extended regular expression context it will lose its special meaning and behave as a literal character "|". If you followed up to here you will notice that this is again exact opposite of basic regular expressions.

Example:

$ egrep "n\|p" regex.txt
global|regular|expression|print

6. Bracket Expressions

Now, that we are acquainted with basics of regular expressions, we can engage our exploration into a more powerful and yet more complex nature of regular expressions. The first stop will be the use of "[" and "]" known as "Bracket Expressions". The story behind the "Bracket Expressions" is that any characters enclosed by "[" and "]" will match any single character in that list. Let's wrap a letter "e" with "[]" and see what happens:

As you can see nothing unusual happened here. Our current regular expression merely matched keyword "expression" and grep therefore printed respective line. On that ground, the following regular expression will also do the same trick:

$ grep expression regex.txt
global|regular|expression|print

The power of Bracket Expression comes when you want to match for example a single character in the "[]" list. This is demonstrated in the following example:

Can you think of a way how to formulate a regular expression alternative to the above example without using "[ ] "? Such technique has been already shown earlier!

Using Bracket Expression it is also possible to express a logical NOT. For this we can use a caret symbol "^". In the following example, we use a regular expression to extract all lines holding any characters with the exclusion of characters "a" and "c".

$ cat regex.txt
a
b
c
d
$ grep [^ac] regex.txt
b
d

6.1. Expression Range

Bracket expression also allows you to specify an expression range. Expression range comprises of minimum two characters separated by a hyphen. What it means, is that instead of [0123456789] we can simply use [0-9] or instead of [abc] we can use [a-c]. This is illustrated in the following regex example:

$ cat regex.txt
a
b
c
d
$ grep [^a-c] regex.txt
d

6.2. Character Classes

What follows are pre-defined classes for you to use within bracket expressions.

[:alnum:] - Alphanumeric characters

[:alpha:] - Alphabetic characters

[:cntrl:] - Control characters.

[:digit:] - Digits: 0 1 2 3 4 5 6 7 8 9.

[:graph:] - Graphical characters

[:lower:] - Lower-case letters

[:print:] - Printable characters

[:punct:] - Punctuation characters

[:space:] - Space characters

[:upper:] - Upper-case letters

[:xdigit:] - Hexadecimal digits

In the following regular expression example, we will use [:lower:] and [:space:] to print only lines, which contain lower-case letter(s) or space:

As an opposite example we can use regex anchoring to find all lines ending with ftp:

$ grep ftp$ /etc/services
zope-ftp 8021/tcp

NOTE:Do not mistake caret's ^ meaning with a caret symbol used within bracket expression as they have quite distinct significance in their respective context.

8. The Backslash Character and Special Expressions

There are numerous system tools, including grep, which support "Special Expressions" also known as word boundaries. Here are some Special Expression symbols supported by grep and many other system utilities:

\< - match empty string at the beginning of the word

\> - match empty string at the end of the word

\b - match empty string at the beginning and end of the word

\B - match except at the beginning or end of a word

Let's start with \< which will match empty string from the beginning of the word. Here is our tester file:

As described in the table above, the usage of "?" quantifier is to match preceding item at most once or to make the previous item optional. The previous item in our case is a character "s". Therefore, grep matched only strings with none or single character "s" followed by string "ions". Next quantifier we are going to take a look at is "*" which by definition will match previous item zero or more times.

As illustrated above the "*" quantifier will match all strings in our test file. If you wonder why it also matched "Expreions" keep in mind that the "*" quantifier makes the preceding item optional as opposed to "+" quantifier, which must match preceding item at least once or more times:

With the "{n}" quantifier you can specify precisely how many times the previous item will be matched. For example our:

$ grep -E "Expres{3}ions" regex.txt
Expresssions

command will match string, which starts with "Expre" followed by 3 x "s" and followed by "ions". To stretch our previous regular expression "{n,}" futher, we can specify the minimum value of how many times the preceding item will be matched. As a result, "{3,}" repetition would match 3 or more times:

$ grep -E "Expres{3,}ions" regex.txt
Expressssssions
Expresssions

To extend the above regular expression even further we can specify range. Therefore, we replace "{3,}" with "{1,3}" and the following regex would match:

since the previous item "s" is matched at the minimum once but no more than three times.

10. Alternation

You can think of regex alternation as a logical OR operation where regular expressions can be joined together by one or more "|" alteration operators. As a result, this regular expression will match any string corresponding to either alternate regular expression.

11. Precedence

When forming expressions, there is another property of Regular Exppresisons to consider and that is precedence. Similar as it is with arithmetic calculations, regular expressions follow predefined precedence. The highest precedence takes "Repetition" followed by "Concatenation" and the lowest precedence belongs to "Alternation". Consider a following example:

$ cat regex.txt
regex
regexxx
$ grep -E "regex{3}" regex.txt
regexxx

In the aforementioned regular expression, we can see both, Concatenation "regex" and Repetition "x{3}". Since the repetition has higher precedence the above regular expression will match "regexxx" but not "regex".Another example where precedence needs to be taken into account is when using Alteration operator "|" which has the lowest precedence from all regular expressions. Consider a following example:

Since the alteration operator "|" has lowest precedence the above regular expression will match any concatenated expression. In our case, it will be "regular" with anchor "^" and "expressions" with an end of the line anchor "$". In order to give any regex operator higher precedence we need to use "()". In the following example, we will use "()" to override Alteration operator precedence to a higher priority, which makes noticeable difference:

$ grep -E "^(regular|expressions)$" regex.txt
regular
expressions

In this example, the alteration operator is evaluated first as it creates a simple subexpression using "()". Therefore, as a result the above regular expression will only match lines, which contain "^regular$" OR "^expressions$".

12. Back References and Subexpressions

Any substring folded by "()" will create a subexpression which can be used as a back reference in succeeding regular expression. This is illustrated by the following example:

Subexpression of concatenated regular expression "re" is used as a back reference later when forming regular expression by use of \1 digit. The order used to form subexpressions "n" needs to be consistent with back reference "\n":

$ grep -E "(r)(e)gular \2xp\1\2ssions" regex.txt
regular expressions

13. Conclusion

Regular expressions are very powerful tool in hands of any system admin, programmer ( BASH, PHP, C#, Java and many more.. ) or casual Linux/Unix command line user. This article attempted to describe in some simple, consistent and plain English manner the basics of Regular Expressions upon which you can further develop your Regular Expressionsskills and thus save yourself from tedious work which text processing can sometimes offer.

As already mentioned before this article only scratches a surface of Regular Expressions. To explore more I will list some great on-line Regular Expressions resources:

Stay Tuned

Make sure you tune in to our Newsletter and Linux IT jobs portal to stay informed about the latest opportunities in the field. To help you to keep your skills sharp we also offer various Linux Training courses. Lastly, visit our Linux Forum if you want to share your Linux experiences with us or require additional help.

Author: Lubos Rendek

In the past I have worked for various companies as a Linux system administrator. Linux system has become my passion and obsession. I love to explore what Linux & GNU/Linux operating system has to offer and share that knowledge with everyone without obligations.

Partners

Who are we?

LinuxCareer.com is not affiliated with any local or international company, nor is it a recruitment or employment agency. We specialise in Linux based careers and closely related Information Technology fields by providing careers advice and latest employment opportunities.

JOIN LINUXCAREER

You can also get involved in the LinuxCareer project by participating on our FORUM or SUBMITTING A LINUX ARTICLE. We offer a range of privileges to our authors and good company.