Mastering Regular Expressions

byTrevor PageonFebruary 13, 2013

In this episode of the How to Program with Java podcast, I will be covering the topic of Java Regular Expressions (regex). This topic is one that I have been avoiding because I really dislike regex. I avoided it, not because it’s not useful, but because I was afraid I wouldn’t be able to teach it. I mostly felt this way because I lacked both experience and knowledge on the subject, but a reader sent in a request via email asking me to talk about regular expressions. So I took a few days to dive deep into the subject and learn everything I could. Now it’s time for me to teach you about mastering regular expressions!

What is a Regular Expression?

The term regular expression is used to describe a formula that is used for searching through a String. That’s really all regex is meant to do, is search through a String for something and then tell you:

If it found it

Where it found it

Simple right? So let’s take a look at how it goes about doing this.

Regex uses two main Objects to carry out its magic, the Pattern class and the Matcher class. The Pattern class is used to identify what it is you’re looking for. The Matcher class is how you actually go about looking for it. So again, Pattern is what you’re looking for and Matcher is how you look for it. So let’s talk about an example: let’s say I have a simple String “Trevor Page”, and I want to see if the occurrence of the String “Trevor” can be found. Obviously when things are this simple, it’s easy for our brains to answer the question without the need of a computer, but don’t worry, things will get more confusing later on!

If we run this “question” through our Pattern and Matcher code, we’ll get the following output:

Found a match starting at index 0 and ending at index 6.

So let’s dissect that output. Obviously it makes sense that the String “Trevor” is found in the String “Trevor Page”, but why does it say that it found it at index 0 through 6? Well, you should know why it says it starts at 0 right? Because it’s a zero based indexing system (just like everything else in Java), but why did it say it ends at 6? The word “Trevor” is only 6 letters in length, and if we’re starting at index 0, then shouldn’t it end at index 5?

0->5 = 6 letters
0->6 = 7 letters

right?

Well you are right, but this regular expression stuff uses a zero based index that is inclusive of the starting character and exclusive of the ending character. This is why we must add one to our ending index. Confusing, but that’s just how they designed it. Silly Java people!

So how is Regex different from using the indexOf() method?

For those of you who are experienced with Java already, you may be asking yourself this question. We already have the means to search through a String for another matching String. It’s called the indexOf method, and it seems to be easier to use than regex. Here’s your answer: regular expressions will search through the entire String and doesn’t stop when it finds the first occurrence, it will keep searching and tell you about every single match that occurs (including the start and ending indexes).

Also, mastering regular expressions means you will have to learn about all of the advanced searching features that exist with regex in Java. So how about we start talking about those topics?

Metacharacters – AKA Wildcards

You may or may not be familiar with the concept of wildcards, if you’re not, I can tell you exactly what they are. Let’s think of our example of the String “Trevor Page”, let’s say we wanted to search that String to see if there are any occurrences of “Trev”, “Trevor” or “Trevor’s”, how would we do this? Well you’d use a wildcard character, more specifically you’d use an asterisk (*). You would simply look for “Trev*”. This would successfully match “Trev Page”, “Trevor Page”, “Trevor’s Page”.

The one caveat is that the asterisk will match the type of character that precedes it. So this means that since I have the letter “v” before the asterisk, that means that it will be looking to match a word character (i.e. not a number). If I had a letter and then an asterisk, it would be looking for a repetition of numbers. To truly match anything, you would need to use both the dot and asterisk together like so: .*

This metacharacter matching is where the real power of regular expressions come from. You’ve seen the asterisk (*), but what other special characters (or metacharacters) can we use with regex? Well here’s the list:

The metacharacters supported by Java regular expressions are:

That’s a heck of a lot of characters, so lets look at some examples of how they’re used shall we?

One of the more commonly used metacharacters are the square brackets []. These are used to group regular text characters (or numbers) together. The “sets” or “groups” of characters (or numbers) are known as character classes (not to be confused with Java classes). Let’s say you wish to search a String for the occurrence of a few random letters, you would structure your regular expression like so:

[abc]

This will find matches for the letters: a, b or c. So if you were given the sentence “Hello World!”, it would match precisely ZERO occurrences because the letters a, b or c don’t exist in that String. If you had the String “How are you today?” and you applied the [abc] regular expression, you WOULD get a match. Two matches to be exact, this is because in the sentence “How are you today?”, there are two occurrences of the letter “a“. How exciting!

Now if you want to get more fancy with this stuff, we could move onto using a more complex regular expression. Try this one on for size:

[bch]at

Any guesses on what matches you would get from this regex? It may not be obvious at first, but this will actually make three matches, and those words are: bat, cat and hat. You see why? The first three letters are encased in those square brackets, so they form a character class where Java matches either the letter: b, c or h. Then it will append the search for a String literal “at”. So what do you get when you combine the letters b, c or h with the String “at”? You get: bat, cat or hat.

Just gripping stuff, really! How about this regex:

[^bch]at

Any guesses on what this would match? If you’re mastering regular expressions, then you might be able to guess. The only difference with this regex than the one before is one symbol, the carrot (^). This is actually just a negating symbol, which means it works just like the exclamation mark in Java conditional statements. You would read this regex as saying: any letters other than b, c or h that have the String “at” attached to them. One example of a matching String here would be the word “rat”. This meets the requirements because the “r” at the beginning of the word “rat” is NOT a b, c or h, and it has the String “at” attached to it. Make sense?

Let’s expand a little bit on this concept.

Regex Ranges

A range in regular expressions is defined by using the hyphen (-). The best way for me to demonstrate this is with an example, consider this regex:

[b-d]at

The introduction of the hyphen (-) between the “b” and “d” characters insinuates a range. This means that all letters between “b” and “d” in the alphabet will be matches (ranges are inclusive of their beginning and ending characters). So this particular regex will match the following words: bat, cat, dat.

Ranges can also be used with numbers, or even with numbers in combination with letters. Let’s say you’re interested in searching through some file names, and you want to see if there are any that have the words “file1”, “file2”, “file3” or “files1”, “files2”, “files3”. What would the regex be to match those names?

file[1-3]|files[1-3]

Here you see that we’ve made use of the hyphen to define a range of numbers from 1 to 3 and we’ve also used the “OR” operator (|) to choose between either the word “file” or the word “files”.

Predefined Character Classes

In order to master regular expressions, you’ll need to know about predefined character classes. You can think of these things as shortcuts, they’ll save you from typing a few extra characters. Let me show you what I mean, let’s say you want to determine if a String contains JUST numeric characters, absolutely NO word characters. How would you do this? Well you could do something like:

[a-zA-Z]

This will search your String to see if there are any occurrences of a word character. If this returns anything, then we can say without doubt that your String is not strictly numeric. But look at all those characters you had to type out to convey this desire… just staggering really, 8 whole characters… what a waste of precious finger dexterity. Now let’s use a shortcut:

\D

There you have it, you just saved yourself from typing out an additional 6 characters. Don’t your fingers feel rested? Now, my sarcasm here may be well founded, but when you really get to mastering regular expressions, you’ll see that these shortcuts do save you a lot of time and effort. There are six different predefined character classes, \d for detecting digits, \s for detecting whitespace and \w for detecting word characters, then you just capitalize each one of those to look for the OPPOSITE. In other words, since \d looks for digits, \D looks for non-digits. Since \s looks for whitespace character, \S looks for non-whitespace characters, and then that goes without saying that since \w looks for word characters, then \W looks for non-word characters.

I’ll leave you with this problem to try out for yourself. In my home country of Canada, we have something called a social insurance number (SIN), this is just like the American social security number (SSN), but it has a slightly different arrangements of numbers. Here’s what a Canadian SIN looks like:

123-456-789

Can you create a regular expression that will properly verify if a given SIN is in the proper format? To be explicit, the proper format is:

any 3 digits followed by a hyphen, followed by any three digits, followed by a hyphen, followed by any three digits and then NO MORE characters at all. Give it a shot and you’ll be on your way to mastering regular expressions!

Sorry, I have to admit that I do not really understand “possessive”. Either the result is the same as “greedy” or no match is found at all.
I am not able to create a useful example for “possessive” 🙁
Greedy and reluctant exist in several other Regex languages/engines, but possessive seems to be a Java sepciality and the Oracle documentation does not help me.

Thanks for this! Hopefully this article will stick in my head.
Normally what happens is: I learn regex, I use it for a week, I come to use it again six months later and my brain has completely erased it!

One of the most important features of RegExp, in my opinion, is how transferable the knowledge is once you have mastered the black-art. RegExp has its linage through to UNIX and can be found in a wide variety of coding systems; C++, XSLT and JavaScript (all with minor variations) to name a few.
I was concerned by your example of the asterisk as a wildcard. The term wildcard usually signifies it can be replaced by a range of other values. However, this is not what the asterisk is used for in RegExp as a general rule; the full-stop has that role. Java may be different (that is not my primary skill) but in most implementations I have used the * signified multiple (zero or more) occurrence of the preceding character or character-class.
I would also like to direct fellow listeners/viewers/readers to the following excellent on-line reference: http://www.regular-expressions.info/

From what I saw when I tested my example, the use of the asterisk was able to match zero or any characters (not just the preceding one). This does conflict with the information that you have given in your link.

Upon further study, it looks as though the asterisk is used to match the type of character which precedes it… so if you have a letter then an asterisk, you’ll match any letters. If you have a number then an asterisk, it’ll match any numbers.

Thanks for pointing this out, I’ll update the post with this new information.

Not wanting to labour the point I will make this my last post on the issue.

Consider the RegExp pattern ^Files*[1-3]$. This will match both “File1” and “Files1” but would also match “Filesss1”. However, it should not match “Filex1” as it does not comply with s*, which would comply with ^File.*[1-3]$ as dot means any character (wildcard).

In my opinion RegExp is a vital asset in a programmer’s arsenal, in a wide select of coding systems – just takes a bit of dedication and lateral think.

I understand that, thanks. But when I create a separate function, public static void calResults(), I seem to have to initiate all the variables again, if I use a int variable in the public static void main it gives the error message cannot find symbol and I have to re-initiate it again.

Also what are all the function headers, (ie, public static void, public static int, public void, public int, etc.) How do you decide what to use?

I’ve started listening to your podcasts during my commute time to work. Thanks for all the work you are doing!

I have a small observation about why in Java the EndIndex of the substring is exclusive: I think it is because it is easier to calculate the length of the string/substring we are getting just substracting EndIndex from BeginIndex.