Good old regular expressions

Here are two examples that persuaded me long ago that regular expressions could be powerful. Both come from The Unix Programming Environment by Kernighan and Pike (1984).

The first problem is to produce a list of all English words that contain all five vowels exactly once and in alphabetical order.

The book creates a regular expression aphavowels

^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

then uses it to filter a dictionary file

egrep -f alphavowels /usr/dict/web2

This produced 16 words ranging from abstemious to majestious.

The second problem is to produce a list of all English words of at least six letters with letters appearing in increasing alphabetical order.

The book creates a regular expression named monotonic

^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$

then uses it to filter a dictionary file as before, except there is an additional filter stage.

egrep -f monotonic /usr/dict/web2 | grep '......'

This produced 17 words including common words such as almost and ghosty. Some of the more interesting results were bijoux, chintz, and egilops. Kernighan and Pike explain that egilops is a disease that attacks wheat.

Explanation

The regular expressions above are fairly long, but shorter and more transparent than a procedural program to solve the same problem. The solutions may look mysterious at first sight, but they are entirely straight-forward once you know the most basic features of regular expressions.

In the first problem, the pattern [^aeiou] says to look for anything that isn’t a vowel, i.e. is a consonant (assuming entries in the dictionary file contain only letters). So the regular expression says to start at the beginning of each line and look for zero or more consonants, followed by an ‘a’, followed by zero or more consonants, followed by an ‘e’, and so on down to a ‘u’ optionally followed by consonants at the end of the line.

In the second problem, the question mark matches zero or one instances of a character, i.e. the character is optional. The regular expression says to start at the beginning of each line, look for an optional ‘a’, followed by an optional ‘b’, and so forth to the end of the line. Then the output is filtered by another regular expression ....... Since a period matches any character, a sequence of six periods says to select only words that contain six characters.

1) “optionally followed by a consonant at the end of the line” should be “followed by zero or more consonants…”

2) I don’t see why the first regex used a rather than a+, and the second regex used ? rather than *, since there was no restriction in the problem description that said that each vowel/letter should appear exactly once. For instance, with the asterisk version of the second regex, several more words are returned, such as “accent” and “chilly”.

Cool. I think your phrasing (“optionally followed by consonants”) is better :)

Also, regarding my comment about repeating letters: I see you updated the description of problem 1, but not problem 2, and figured out why: it indeed explicitly leaves out repeated letters, since in a sequence like “…cc…” the second c does not “appear in increasing alphabetical order” relative to the previous “c”. (This might be obvious to you or others, but I’m adding it here for the record.)

In the first problem, you state in the problem text “all five vowels”. Isn’t y also a vowel in English? Just because it can be used as a consonant doesn’t mean it can’t be a vowel. I’m not a native English speaker though.

Sten, you are right that y can act as both a vowel and a consonant. The problem with adding y to the problem is that you can’t know whether the y is being used as a vowel or a consonant with a regex solution. If you don’t care about its usage, though, you can just add on y:

Since a period matches any character, a sequence of six periods says to select XonlyX words that contain six characters. *Since grep matches parts of an input line rather than doing an exact match, this will return words of 6 characters or more, like “egilops”.*