RegEx Quick Reference

We use Regular Expressions for parsing firewall logs. Even though the full syntax of Regular Expressions is
complex, we usually only need to use a few regex operations to parse logs. This is oriented towards people who
are writing complete log line parser. If you are looking at this to get some hints for how to phrase the regex
for the line_filter variable in dshield.cnf, then you can just skim and pick up the highlights.

line_filter and line_exclude variables

Most of the time you can just type in the string that you want to match.

line_filter=input deny

The only time you need to worry about any of the regular expression pattern stuff is if the string you need
to match contains one of the characters that are regex metacharacters. (See below.) Then just put a "\"
before the character that is a metacharacter. For example, if you wanted to match "kernel 2.2" then

line_filter=kernel 2\.2

The other likely candidate is if you want to have several alternates. For instance, if you wanted to match
either "input deny" and "input reject" then

Parsing log lines

The general idea is to have a regular expression set up to match the different parts of the log and have the
portions that want to be assigned to variable delimited by parentheses. Short Perl example. Assume that $line
contains the "raw" log line that needs to be converted:

The portions that are delimited by parentheses are assigned to variables. In this case, we are attempting to
match a *NIX syslog date, like Apr 24 and assign the month to the $month variable and the
day to the $day variable.

[A-Z] matches one character that must be upper case. [a-z]{2} matches exactly two
characters that must be lower case. " +" (space followed by "+") skips over one or more
spaces. Then match \d{1,2} and assign to the $day variable. (Matches at least one and no
more than two "digits.")

This example only extracted the month and day from the log line. To make it workable for our purposes, you
would extend this concept and add additional variables and regular expression patterns to match the rest of
the log.

Very short regex reference

These "metacharacters" modify the meanings of other characters and need escaping with \ if they are
to appear as literal characters in your pattern.

[](){}\.*?-+^$@

Example. If you were parsing a log line and the log put the protocol in square braces, like "[tcp]", you'd
have to phrase the regex pattern like \[tcp\] so that the square brace characters [] are treated
like literal characters and not as the beginning of a character class. [tcp] would match a single
character that is either "t" or "c" or "p", which probably isn't what you want to do.

.

matches any single character except for \n (newline.)

*

Modifier. Zero or or more of the preceding character. ".*" to match a bunch of characters.
"*" by itself probably won't do what you want, because "*" is a modifier.

+

Modifier. One or more of the preceding character. Same idea as "*", except that it requires at least
one character to be present.

Examples

Stick these together to make a regex that can match all the parts of the log line that you are parsing. See
above example for the syntax to assign these to variables. Note that these examples show parentheses,
because you generally want to assign these to a variable. If you just want to match but not assign to a
variable, then don't use the parentheses.

([A-Z][a-z]{2})

Matches month formatted like "Apr" First character must be upper case, second and third characters,
lower case. Note that the month is still in text format and you need to do an additional translation
to get in into the "MM" numerical format that the DShield format requires (2002-05-24). Look at one
of the existing parsers to see how to do this.

(\d{1,2})

Matches day number. (One or two digits)

(\d+):(\d+):(\d+)

Time. HH, MM, and SS separated by ":", and are assigned to separate variables. This is a bit sloppy
because we aren't enforcing any character count. "1", "10", and "1234567890" all match \d+.
":" is one of the few punctuation type characters that isn't a metacharacter and you can
just put it there, without having to escape it with a "\".

(\d+\.\d+\.\d+\.\d+)

Matches an IP address. Note that the periods need to be delimited with "\". And is sloppy (see above
comment), but a precise IP regex pattern is real long and complicated. See The Perl
Cookbook "Regex Grabbag" for an example of a precise IP matching regex pattern.

(\d{1,5})

Matches a port. (Minimum of one and maximum of five digits)

(tcp|udp)

Alternation. Matches (lower case) "tcp" or "udp"

\->

Matches "->" The "-" needs escaping with a "\" because it is a metacharacter.

([SAFURP12])

Matches the valid single character flags

(\([SAFURP12]\))

Same as above, but assumes that the flags are stored like "(S)" in the log. So the ()
parentheses characters need to be escaped with "\".

\s+

Skip over one or more characters of whitespace. Not delimited by parentheses because we aren't
assigning this to a variable.

.*

Skips over zero or more characters of stuff that we aren't interested in. Use .+ to require
at least one character.

Hints

To develop a regex that can parse a whole line, start off with a short one, like the one in the above
example, and add additional variables and matching patterns one at a time. If you try to write the
entire regex pattern before you test it, you will most likely go mad trying to figure out why it doesn't
match. Remember "one at a time."

This document covers most of what you need to know about regular expressions to fill in the
line_filter variable, or to write a parser. But there is much more that regular expressions can do.
See the sections on regular expressions in Learning Perl, Programming Perl, The Perl
Cookbook, or Mastering Regular Expressions (O'Reilly.) Or just hit Google and search for
"regular expression tutorial"