js1 has asked for the
wisdom of the Perl Monks concerning the following question:

Hi,

I have a script to parse a proxy log (node 362548) and I'm planning to use DBI to update a mysql database when the sql insert command reaches 2MB (node 362109).

In order to run this script as quickly as possible would I be right to run the database update whilst I carry on parsing the log? I presume I need some sort of thread to do this (not something I've done before in perl).

Could someone tell me if I'm on the right track here and if not point me in the right direction? If threads are the way to go where would be the best place to start?

Also this script will run on a machine with 2 cpus. Will the threading enable it to use both cpu's?

Both reading the log and updating the database are apt to be I/O bound, so threads or a forked process make sense (particularly if the log and db are on different spindles). I'm not too familiar with perl threads, so lets fork the db updater with a pipe from the parent to ship data over,

I seem to recall having problems with database connections 'going away' when using database handle inside forked off child processes. I resolved this by each child process recreating it's own database handle.

Another thing to watch is maybe if the DB and log reside on the same disk then you could have decreased performance if doing major I/O on both at once, Hence maybe try and read the whole log into memory

The issue with DBI and forking is this: You need to set $dbh->{InactiveDestroy} = 1 in all of your child processes. Otherwise, when one child process dies it may kill database connections in other processes. (This is because as part of the DESTROY for the handle, it tells the databse server that it is done with the connection... even if there's another process that is not done with it)

Also be warned of the greater conceptual issue that you can only use a DB connection in one process. The connection may exist in other processes (hence the warning above about setting InactiveDestroy), but it can only be used in one.

Also, no, DBI is not threadsafe (unless this has changed recently... and that would be a big deal).

Reads Q1, processes the records and feeds Q2 with the data for updating the DB.

Reads Q2, and fires off the updates.

I'd use a high/low water marking system to allow threads to yield their timeslices once the queue they were feeding moved above the high water mark, until it dropped back to the low watermark. Reading threads would yield when their input queues dropped below low water.

Then run some benchmarks to establish the best values for the watermarks.

Why do all that stuff with high and low marks, and yielding threads, if the OS can do it for you (and probably a lot more efficiently)? Just use three processes and pipes. Writing to a pipe whose buffer is full will block - causing the process to give up its timeslice - and so will reading from a pipe whose buffer is empty.

The idea is to ensure that the cpu bound part of the process never has to wait for data and so uses as many timeslices available to it as possible.

The water marks allow you to easily tailor the threading to maximise throughput.

Using the queues makes it easy to have more than one thread processing the slow part(s) of the processing. Each thread is identical, you just start more of them. They all read their input from the same queue. You don't get this easy flexibility using pipes.

If the processing of the data is the bottleneck you start two threads for that. If outputting to the DB is the bottleneck, have two threads doing that.

If the DB is running in the same box (with 2 cpu's) then it will likely dominate one of them and all the threads will basically share the other. If the DB is on a different box, then the cpu-bound thread may dominate one process and the IO/DB threads share the other.

The yielding should rarely come into play once you get the right watermark levels established, but it acts as a safeguard for the situations where either the IO or DB slows up--someone does a grep on the disk or hits the DB with a heavy query. It prevents the Q from filling memory whilst the processing at the other end is blocked.

The reason I would try threads are:

I'm more familiar with the threading model (forking is only threading under the covers, and without the control, where I live).

I think that IPC through shared memory is more convenient and easier to program that through the flat stream of a pipe.

You can share structured data using threads. I'm not yet certain if it is up to large scale production use, but it is much improved in 5.8.3.

This final point is quite important with the OP's application. Basically he is reading lines, splitting them into chunks, and then throwing them into a DB. The DB IO is quite likely to be the slowest part of the overall processing.

If having split the lines into chunks, he then has to serialise those chunks to pass them through a pipe to the DB process, he hasn't gained anything by splitting out the DB process.

He would then have to deserialise it and the serialisation/deserialisation is likely to take much the same amount of time as the splitting, which negates the reason for having a separate process for the DB IO.

I can't honestly say whether my thoughts would result in faster overall processing. There are too many factors involved. I don't have a dual processor machine to test on. There are many details that the OP hasn't supplied: where is the DB? How much indexing is on the DB? Is the DB shared with other applications? etc.

Until someone actually tries some of this stuff using threads, nobody knows how it will stand up. Until recently, memory leaks prevented any worthwhile testing. With 5.8.3, that seems to be getting much better to the point where it is now worth trying stuff out again.

Also this script will run on a machine with 2 cpus. Will the threading enable it to use both cpu's?

Perhaps, this will depend on various things. It'll depend on the threading implementation on the OS that's being used, but it will also depend on the sceduling policy of your OS. If for instance your database is running on the same machine, its process might use enough CPU time to occupy a CPU by its own.

Threads may be an answer if you can split your CPU bounded tasks into two - you only have one CPU bound task (the actual parsing), but that can easily be split into two - have each thread deal with half a file. However, instead of using two threads, it's much easier to use two processes (depending on your OS, you might even get more CPU time slots that way), and you save the not insignificant overhead Perl threads will give you. Now you might want to use separate threads to do your I/O as well. But that is something you should only consider if it's an actual bottleneck - and if it is, you can use separate processes as well, and have them communicate with pipes.

If you want this to go faster, I would suggest not worrying about threading, and worry about the inserts instead. Inserts are slow operations on almost any (every?) database.

I don't know MySQL at all, but every other database I've run into has a "bulk insert" mode or command. If you want this to go fast, do whatever you have to in Perl to get the data into a format that can be bulk inserted. (In PostgreSQL, this would be a simple line transform, since the bulk insert format is plain text and "COPY FROM STDIN" magic.) If MySQL doesn't have a bulk insert method, you might want to reconsider your database choice.

The other thing that will help is to do the bulk insert to a table that has no indexes, then move the data to an indexed table (using SQL) when the machine has spare time. This could be done with Perl and DBI, or maybe an ON DELETE trigger that then inserts the record in the indexed table.

Once you're doing bulk inserts, if you are still having performance concerns, then some sort of buffering is appropriate. (both the pipe and queue suggestions above are valid, pick the one you feel most comfortable with.)

What I found is that RedHat9 will allocate one CPU to do computation and the other as a glorified IO manager. This is without threading. Personally, I would see if doing it without threading is fast enough. If it is, who cares if you can make it faster??

Remember - a requirement that most people forget to clarify is "How fast is fast enough?". The pursuit of perfection is harmful to good deliverables.

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

I shouldn't have to say this, but any code, unless otherwise stated, is untested

And if using a bulk insert isn't possible for some reason, first drop the indices on the table(s) involved, insert your data, then rebuild the indices. Unless the tables already contain billions of rows, rebuilding the indices is faster than modifying them millions of times.

Everyone's answers to your questions are great as usual. The only thing I can add is that the documentation for Perl's threading is in perlthrtut for later versions of perl (5.6.0 >=) and the earlier version (5.005 >= ver < 5.6.0) perlothrtut, which I cannot seem to find on perldoc.com right now. I hope this helps!

Ada Lovelace for the palindrome
Albert Einstein for having smelly feet
Alfred Nobel for his contribution to battlefield science
Burkhard Heim for providing the missing link between science and mysticism
Claude Shannnon for riding a unicycle at night at MIT
Donald Knuth for being such a great organist
Edward Teller for being the template for Dr. Strangelove
Edwin Hubble for pretending to be a pipe-smoking English gentleman
Erwin Schrödinger for cruelty to cats
Hedy Lamarr for weaponizing pianos
Hugh Everett for immortality, especially for cats
Isaac Newton for his occult studies
Kikunae Ikeda for discovering the secrets of soy sauce
Larry Wall for his website
Louis Camille Maillard for discovering why steaks taste good
Marie Curie for the shiny stuff
Nikola Tesla for the cool cars
Paul Dirac for speaking one word per hour when socializing
Richard Feynman for his bongo skills
Robert Oppenheimer for his in-depth knowledge of the Bhagavad Gita
Rusi P Taleyarkhan for Cold Fusion
Sigmund Freud for his Ménage ā trois
Theodor W Adorno for his contribution to the reception of jazz
Wilhelm Röntgen for the foundations of body scanners
Yulii Borisovich Khariton for the Tsar Bomba
Other (please explain why)