2009-05-18

This whole entry could be summarized as 'use M-x re-builder' to build your
regular expressions. But let's see if I can stretch that wisdom over a couple
of lines…

For searching and replacing, regular expressions ('regexps') are a very useful
tool. For example, see the entry about getting your ip-number. I am not
going to explain regexps here – there are plenty of good references about
them. Of course, emacs supports regexps - but it's not always so easy,
compaired to e.g. Perl. I am only providing some trivial examples here, please
see Steve Yegge's post on the regexp tricks possible with then-new Emacs 22 (I
can't remember ever needing that kind of regexp-pr0n in real life though…)

Back to regexps - on of the issues with regexps in Elisp is that they need
extra quoting, that is, lots of \-escape characters; regexps can be hard to
comprehend, and this does not help… Why the extra quoting? Let's look at a
simple example. Suppose we want to search for the word cat. And not
category or concatenate. The regular expression would then be \bcat\b.

In Perl you could write this as /\bcat\b\/ (in Perl you specify regexps by
putting them between /-characters).

Not so in Emacs-Lisp. On the Lisp-level, there are no regexps; there are only
strings and only the regexp functions understand their true nature. But
before the strings ever get those functions, the Lisp interpreter does what
it does best: interpreting. And when it sees \b, it interprets it as the
backspace-character.

To make it not do that, you'll need to pay the 'slash-tax' and write
something like:

(re-search-forward "\\bcat\\b")

Things can go ugly quickly from there - think of when you need search for
something with a backslash, like our regex \bcat\b itself; you'd need to do:

(re-search-forward "\\\\bcat\\\\b")

slash tax break

To make things even more interesting, in different contexts, different rules
apply. The above is all about regexps in strings in Emacs-Lisp. However,
things are different when you provide a string interactively.

Suppose you search through your buffer (with M-x isearch-forward-regexp or
C-M-s). Now, your input is not interpreted by the Lisp interpreter (after
all, it's just user input). So, you're exempt from the slash tax, and you can
use \bcat\b to match, well, \bcat\b.

re-builder

So, regexps can be hard, and Emacs-Lisp makes it somewhat harder. A natural
way to come up with the regular expression you need, is to use
trial-and-error, and this is exactly what isearch-forward-regexp and
friends do. But what about the slash-taxed regexps that you need in your Lisp
code?

The answer is M-x re-builder. I am sure many people are already using it,
but even if there were only one person that finds out about this through
this blog-post, it'd be worth it! And this is the whole trick here: whenever
you need a regexp in your code, put the kind of string it should match in
a buffer, and enter M-x re-builder.

re-builder will put some quotes in the minibuffer. You type your regexp
there, and it will show the matches in the buffer as you type. It even supports different
regex-syntaxes. By default, re-builder will help you with the
strings-in-Emacs-lisp kind of regexps; this is called the read-syntax. But you
can switch to the user-input regexps with C-c TAB string RET (yes, these are
called string here). There are some other possible syntaxes as well.

One final trick for re-builder is the subexpression mode, that you
activate with C-c C-e (and leave with q). You can than see what
subexpressions match (ie. if we can match cat, cut, cot etc., with
\\bc\\(.\\)t\\b, and the subexpression would then contain the middle
letter. re-builder automatically converts between the syntaxes it supports,
so you could use 'string-mode' as well, bc\(.\)t\b.