On Wed, 2005-03-02 at 20:55 -0500, Tom Lane wrote:
> Michael Adler <adler(at)pobox(dot)com> writes:
> > Looking at the "Response Time Charts"
>
> > 8.0.1/ARC
> > http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/rt.html
>
> > 20050301 with 2Q patch
> > http://www.osdl.org/projects/dbt2dev/results/dev4-010/313/rt.html
>
> > It seems like the average response time has gone down, but the worse
> > case ceiling has raised about 35%.
>
> The worst cases are associated with checkpoints. I'm not sure why a
> checkpoint would have a greater effect on the 2Q system than an ARC
> system --- checkpoint doesn't request any new buffers so you'd think
> it'd be independent. Maybe this says that the bgwriter is less
> effective with 2Q, so that there are more dirty buffers remaining to
> be written at the checkpoint? But why?
The pattern seems familiar. Reduced average response time increases
total throughput, which on this test means we have more dirty buffers to
write at checkpoint time.
I would not neccessarily suspect 2Q over ARC, at least initially.
The pattern of behaviour is similar across ARC, 2Q and Clock, though the
checkpoint points differ in intensity. The latter makes me suspect
BufMgrLock contention or similar.
There is a two-level effect at Checkpoint time...first we have the write
from PostgreSQL buffers to OS cache, then we have the write from OS
cache to disk by the pdflush daemon. At this point, I'm not certain
whether the delay is caused by the checkpointing or the pdflush daemons.
Mark and I had discussed some investigations around that. This behaviour
is new in the 2.6 kernel, so it is possible there is an unpleasant
interaction there, though I do not wish to cast random blame.
Checkpoint doesn't request new buffers, but it does require the
BufMgrLock in order to write all of the dirty buffers. It could be that
the I/Os map direct to OS cache, so that the tight loop to write out
dirty buffers causes such an extreme backlog for the BufMgrLock that it
takes more than a minute to clear and return to normal contention.
It could be that at checkpoint time, the number of writes exceeds the
dirty_ratio and the kernel forces the checkpoint process to bypass the
cache and pdflush daemons altogether, and performing the I/O itself.
Single-threaded, this would display the scalability profile we see. Some
kernel level questions in there...
There is no documented event-state model for LWlock acquisition, so it
might be possible that there is a complex bottleneck in amongst them.
Amdahl's Law tells me that looking at the checkpoints is the next best
action for tuning, since they add considerably to the average response
time. Looking at the oprofile for the run as a whole is missing out the
delayed transaction behaviour that occurs during checkpoints.
I would like to try and catch an oprofile of the system while performing
a checkpoint, as a way to give us some clues. Perhaps that could be
achieved by forcing a manual checkpoint as superuser, and making that
interaction cause a switch to a new oprofile output file.
Best Regards, Simon Riggs