I added a newline to make it easier to read. Tim uses the regular
expression
r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+ )"
as the filter, which matches the log file example I gave. Note that
the regular expression ends with a space and there's a space in the
square brackets. They might not have come across well in this essay.

Tim isn't doing this because he needs a way to parse log files. He's
doing this as a way to get experience in using Erlang for something
he's done before and knows a lot about. I would call it a practice
piece, and Emily would say it's a kata.

Tim's conclusion was that regular expressions and file parsing in
Erlang were slow, compared to languages like Ruby and Perl which have
a stronger emphasis on those. This stirred up some discussion in the
Erlang community as people worked to show how to make the code faster.
The belief, perhaps article of faith, is that it's easier to speed up
Erlang's I/O and regexp code than it is for other languages to match
Erlang's concurrency support.

Fredrik Lundh wrote a
set of notes on Wide Finder starting with Santiago's code. He
shows how to speed it up by being smarter with the code and by taking
advantage of a dual-core machine. He described what he did through
series of related programs, available from it's
SVN repository. I summarize each one here along with timings,
both given by Fredrik in his README.txt
file and by me on my MacBook Pro 2.33Gz Intel Core 2 Duo:

I started by implementing my own version using just the standard
Python libraries. The biggest speed gain came from using a
memory-mapped file instead of using normal file I/O, but Fredrik only
showed how that helped the multiprocessing version. I wanted to see
how much it sped up the single threaded version.

I used wf-2.py as my base and added to it. I think it's best to start
something like this without really reviewing other code. I didn't
look at the programs in much detail. If this is a practice piece then
part of the practice is being able to solve the problem in different
ways, and with relatively little guidance. I also find that by
working out a problem first helps understand why other implementions
were done in a given way.

Python has a concept called a buffer
interface. Strings and a few other data types, including
memory-mapped files, implement the buffer interface. String
operations (like 'find') and regular expressions can work on buffer
objects. Which means I am able to apply the regular expression
directly to a memory-mapped file.

In my original code I used finditer instead of
findall to find all the matches. The finditer returns match
objects while findall only returns the matching text. (But read the
docs because findall returns a tuple instead of a string if there is
more than one group in the pattern.) Making the extra objects takes
time and I was able to make my code about 5% faster using findall.

Had I not tried to implement my code first, I wouldn't have thought of
trying 'finditer'. Had I not compared my result to Fredrik's I
wouldn't have thought of using 'findall'. Becoming a better
programmer means you have to do both.

My single-threaded dalke-wf-7.py runs in 1.3s. Compare that to Fredrik's
best single-threaded version at 1.7s and his best dual-process version
wf-6.py at 1.0s. I think this is the best performance and cleanest
code you can get using this general approach, though I would like a
shorter way to open a memory-mapped file. I count 12
non-blank/non-timing lines while Tim's original Ruby code has 11. His
machine looks about 1/2 the speed of mine so this Python code is
roughly 2-3 faster than his Ruby. That's about what I've seen in
other performance comparisons between the two languages.

Some years back I spent some time working on Martel, which uses an
(almost) regular expression language as a way to parse semi-structured
flat-files as if it's in XML. The element definitions comes from
group names in the regular expression, and the text parsing is done
with mxTextTools. I wrote some code to have Martel convert Tim's
filter regular expression into a tag table. I did change the regexp
so that the group has a name, as group names are used as tag in the
tag table.

Bad news. It's REALLY SLOW. It takes about 26s to process the log
file, compared to the under 2s using the normal approach and 1.3s for
my best single-threaded code.

What's wrong? Well, one thing is that my Martel code is meant as a
validator. It assumes most characters will match and it's very bad at
searching for a match because it checks the text
character-by-character. More sophisticated algorithms, like
Boyer-Moore, can analyze the search string and skip testing characters
when there's no chance of a match.

For example, the search string "GET /ongoing/When/" only contains
the characters " /EGTWeghino". If neither text[i] nor
text[i+18] contain a character in that set then it's
impossible for that search string to exist in that range. Python's
find, index and x in s tests use a
smarter superlinear
substring search algorithm, which Fredrik uses as a first-level filter
in his wf-2.py code.

As you can read from the comments, mxTextTools does not use the
buffer interface. It only support strings, which can be either 8-bit
or unicode. I had to use a chunking method instead. I read CHUNK
bytes followed by a readline to ensure that the resulting text only
contains complete lines.

This is a very common technique and it works when there are many
"records" in a large file. In this case each record is one line long.

A disadvantage of this approach is that the chunk size is tunable.
You can either guess a reasonable chunk size, or run tests to find out
the best value. I started with a 1MB chunk size and worked my way
down to find that 12KB gave me the fastest results. Much to my
surpise, 16K is slower than any other value I picked in the entire
range, other than 1K. (Memory and disk block sizes are usually 4K so
1K is not a reasonable chunk size.)

The best case took 0.7s and even the worst sane case at 1.2s shows
that using mxTextTools is faster than dalke-wf-7, which was my best
single-threaded code at 1.3s. But to get that speed requires using a
third-party tool and learning how to use it effectively.

I commented out the tagging code, so it does everything except
parsing. The code to do the chunking and setup for mxTextTools takes
about 0.36s, or about 1/2 of the time.

Variable lookups in module scope are slower than lookups in local
scope so I introduced the count_file function to get a bit
more speed. I didn't generate numbers for this one but experience
says it's nearly always a performance advantage.

The resulting dalke-wf-10 code finishes in 1.0s. Yes, you read that right.
It's faster than the mmap/findall solution of dalke-wf-7.py, which took
1.3s. Still not as fast as mxTextTools at 0.7s, but this solution
uses only the standard library.