On 01/16/2012 11:00 AM, Robert Haas wrote:
> Also, I am still struggling with what the right benchmarking
> methodology even is to judge whether
> any patch in this area "works". Can you provide more details about
> your test setup?
The "test" setup is a production server with a few hundred users at peak
workload, reading and writing to the database. Each RAID controller
(couple of them with their own tablespaces) has either 512MG or 1GB of
battery-backed write cache. The setup that leads to the bad situation
happens like this:
-The steady stream of backend writes that happen between checkpoints
have filled up most of the OS write cache. A look at /proc/meminfo
shows around 2.5GB "Dirty:"
-Since we have shared_buffers set to 512MB to try and keep checkpoint
storms from being too bad, there might be 300MB of dirty pages involved
in the checkpoint. The write phase dumps this all into Linux's cache.
There's now closer to 3GB of dirty data there. @64GB of RAM, this is
still only 4.7% though--just below the effective lower range for
dirty_background_ratio. Linux is perfectly content to let it all sit there.
-Sync phase begins. Between absorption and the new checkpoint writes,
there are >300 segments to sync waiting here.
-The first few syncs force data out of Linux's cache and into the BBWC.
Some of these return almost instantly. Others block for a moderate
number of seconds. That's not necessarily a showstopper, on XFS at
least. So long as the checkpointer is not being given all of the I/O in
the system, the fact that it's stuck waiting for a sync doesn't mean the
server is unresponsive to the needs of other backends. Early data might
look like this:
DEBUG: Sync #1 time=21.969000 gap=0.000000 msec
DEBUG: Sync #2 time=40.378000 gap=0.000000 msec
DEBUG: Sync #3 time=12574.224000 gap=3007.614000 msec
DEBUG: Sync #4 time=91.385000 gap=2433.719000 msec
DEBUG: Sync #5 time=2119.122000 gap=2836.741000 msec
DEBUG: Sync #6 time=67.134000 gap=2840.791000 msec
DEBUG: Sync #7 time=62.005000 gap=3004.823000 msec
DEBUG: Sync #8 time=0.004000 gap=2818.031000 msec
DEBUG: Sync #9 time=0.006000 gap=3012.026000 msec
DEBUG: Sync #10 time=302.750000 gap=3003.958000 msec
[Here 'gap' is a precise measurement of how close the sync pause feature
is working, with it set to 3 seconds. This is from an earlier version
of this patch. All the timing issues I used to measure went away in the
current implementation because it doesn't have to worry about doing
background writer LRU work anymore, with the checkpointer split out]
But after a few hundred of these, every downstream cache is filled up.
The result is seeing more really ugly sync times, like #164 here:
DEBUG: Sync #160 time=1147.386000 gap=2801.047000 msec
DEBUG: Sync #161 time=0.004000 gap=4075.115000 msec
DEBUG: Sync #162 time=0.005000 gap=2943.966000 msec
DEBUG: Sync #163 time=962.769000 gap=3003.906000 msec
DEBUG: Sync #164 time=45125.991000 gap=3033.228000 msec
DEBUG: Sync #165 time=4.031000 gap=2818.013000 msec
DEBUG: Sync #166 time=212.537000 gap=3039.979000 msec
DEBUG: Sync #167 time=0.005000 gap=2820.023000 msec
...
DEBUG: Sync #355 time=2.550000 gap=2806.425000 msec
LOG: Sync 355 files longest=45125.991000 msec average=1276.177977 msec
At the same time #164 is happening, that 45 second long window, a pile
of clients will get stuck where they can't do any I/O. The RAID
controller that used to have a useful mix of data is now completely
filled with >=512MB of random writes. It's now failing to write as fast
as new data is coming in. Eventually that leads to pressure building up
in Linux's cache. Now you're in the bad place: dirty_background_ratio
is crossed, Linux is now worried about spooling all cached writes to
disk as fast as it can, the checkpointer is sync'ing its own important
data to disk as fast as it can too, and all caches are inefficient
because they're full.
To recreate a scenario like this, I've realized the benchmark needs to
have a couple of characteristics:
-It has to focus on transaction latency instead of throughput. We know
that doing syncs more often will lower throughput due to reduced
reordering etc.
-It cannot run at maximum possible speed all the time. It needs to be
the case that the system keeps up with the load during the rest of the
time, but the sync phase of checkpoints causes I/O to queue faster than
it's draining, thus saturating all caches and then blocking backends.
Ideally, "Dirty:" in /proc/meminfo will reach >90% of the
dirty_background_ratio trigger line around the same time the sync phase
starts.
-There should be a lot of clients doing a mix of work. The way Linux
I/O works, the scheduling for readers vs. writers is complicated, and
this is one of the few areas where things like CFQ vs. Deadline matter.
I've realized now one reason I never got anywhere with this while
running pgbench tests is that pgbench always runs at 100% of capacity.
It fills all the caches involved completely as fast as it can, and every
checkpoint starts with them already filled to capacity. So when latency
gets bad at checkpoint time, no amount of clever reordering will help
keep those writes from interfering with other processes. There just
isn't any room to work with left.
What I think is needed instead is a write-heavy benchmark with a think
time in it, so that we can dial the workload up to, say, 90% of I/O
capacity, but that spikes to 100% when checkpoint sync happens. Then
rearrangements in syncing that reduces caching pressure should be
visible as a latency reduction in client response times. My guess is
that dbt-2 can be configured to provide such a workload, and I don't see
a way forward here except for me to fully embrace that and start over
with it.
> Just one random thought: I wonder if it would make sense to cap the
> delay after each sync to the time spending performing that sync. That
> would make the tuning of the delay less sensitive to the total number
> of files, because we won't unnecessarily wait after each sync when
> they're not actually taking any time to complete.
This is one of the attractive ideas in this area that didn't work out so
well when tested. The problem is that writes into a battery-backed
write cache will show zero latency for some time until the cache is
filled...and then you're done. You have to pause anyway, even though it
seems write speed is massive, to give the cache some time to drain to
disk between syncs that push data toward it. Even though it absorbed
your previous write with no delay, that doesn't mean it takes no time to
process that write. With proper write caching, that processing is just
happening asynchronously.
This is related to another observation, noting what went wrong when we
tried deploying my fully auto-tuning sync spread patch onto production.
If the sync phase of the checkpoint starts to fall behind, and you've
configured for a sync pause, you have to just suck that up and accept
you'll finish late[1]. When you do get into the situation where the
cache is completely filled, writes will slow dramatically. In the above
log example, sync #164 taking 45 seconds means that #165 will surely be
considered behind schedule now. If you use that feedback to then reduce
the sync pause, feeling that you are behind schedule and cannot afford
to pause anymore, now you've degenerated right back to the original
troubled behavior: sync calls, as fast as they can be accepted by the
OS, no delay between them.
[1] Where I think I'm going to end up with this eventually now is that
setting checkpoint_sync_pause is the important tunable. The parameter
that then gets auto-tuned is checkpoint_timeout. If you have 300
relations to sync and you have to wait 10 seconds between syncs to get
latency down, the server is going to inform you an hour between
checkpoints is all you can do here.
--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com