On 2016-03-07 09:41:51 -0800, Andres Freund wrote:
> > Due to the difference in amount of RAM, each machine used different scales -
> > the goal is to have small, ~50% RAM, >200% RAM sizes:
> >
> > 1) Xeon: 100, 400, 6000
> > 2) i5: 50, 200, 3000
> >
> > The commits actually tested are
> >
> >cfafd8be (right before the first patch)
> >7975c5e0 Allow the WAL writer to flush WAL at a reduced rate.
> >db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ...
>
> Huh, now I'm a bit confused. These are the commits you tested? Those
> aren't the ones doing sorting and flushing?
To clarify: The reason we'd not expect to see much difference here is
that the above commits really only have any affect above noise if you
use synchronous_commit=off. Without async commit it's just one
additional gettimeofday() call and a few additional branches in the wal
writer every wal_writer_delay.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hello Tomas,
One of the goals of this thread (as I understand it) was to make the overall
behavior smoother - eliminate sudden drops in transaction rate due to bursts
of random I/O etc.
One way to look at this is in terms of how much the tps fluctuates, so let's
see some charts. I've collected per-second tps measurements (using the
aggregation built into pgbench) but looking at that directly is pretty
pointless because it's very difficult to compare two noisy lines jumping up
and down.
So instead let's see CDF of the per-second tps measurements. I.e. we have
3600 tps measurements, and given a tps value the question is what percentage
of the measurements is below this value.
y = Probability(tps <= x)
We prefer higher values, and the ideal behavior would be that we get exactly
the same tps every second. Thus an ideal CDF line would be a step line. Of
course, that's rarely the case in practice. But comparing two CDF curves is
easy - the line more to the right is better, at least for tps measurements,
where we prefer higher values.
Very nice and interesting graphs!
Alas not easy to interpret for the HDD, as there are better/worse
variation all along the distribution, the lines cross one another, so how
it fares overall is unclear.
Maybe a simple indication would be to compute the standard deviation on
the per second tps? The median maybe interesting as well.
I do have some more data, but those are the most interesting charts. The rest
usually shows about the same thing (or nothing).
Overall, I'm not quite sure the patches actually achieve the intended goals.
On the 10k SAS drives I got better performance, but apparently much more
variable behavior. On SSDs, I get a bit worse results.
Indeed.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Sat, Feb 20, 2016 at 5:08 AM, Fabien COELHO wrote:
>> Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO
>> somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge.
>>
>>
>> https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu
>
>
> Interesting! To summarize it, 25% performance degradation from best kernel
> (2.6.32) to worst (3.2.0), that is indeed significant.
As far as I recall, the OS cache eviction is very aggressive in 3.2,
so it would be possible that data from the FS cache that was just read
could be evicted even if it was not used yet. Thie represents a large
difference when the database does not fit in RAM.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hallo Patric,
Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify
IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is
huge.
https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu
Interesting! To summarize it, 25% performance degradation from best kernel
(2.6.32) to worst (3.2.0), that is indeed significant.
You might consider upgrading your kernel to 3.13 LTS. It's quite easy
[...]
There are other stuff running on the hardware that I do not wish to touch,
so upgrading the particular host is currently not an option, otherwise I
would have switched to trusty.
Thanks for the pointer.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hello.
Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.
Yes, these many runs show that 32 is basically as good or better than 64.
I'll do some runs with 16/48 to have some more data.
I gather that you didn't play with
backend_flush_after/bgwriter_flush_after, i.e. you left them at their
default values? Especially backend_flush_after can have a significant
positive and negative performance impact.
Indeed, non reported configuration options have their default values.
There were also minor changes in the default options for logging (prefix,
checkpoint, ...), but nothing significant, and always the same for all
runs.
[...] Ubuntu 12.04 LTS (precise)
That's with 12.04's standard kernel?
Yes.
checkpoint_flush_after = { none, 0, 32, 64 }
Did you re-initdb between the runs?
Yes, all runs are from scratch (initdb, pgbench -i, some warmup...).
I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.
Yes, I agree that probably the performance changes on long vs short runs
(andres00c vs andres00b) is due to autovacuum.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi,
On 2016-02-19 10:16:41 +0100, Fabien COELHO wrote:
> Below the results of a lot of tests with pgbench to exercise checkpoints on
> the above version when fetched.
Wow, that's a great test series.
> Overall comments:
> - sorting & flushing is basically always a winner
> - benchmarking with short runs on large databases is a bad idea
>the results are very different if a longer run is used
>(see andres00b vs andres00c)
Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.
I gather that you didn't play with
backend_flush_after/bgwriter_flush_after, i.e. you left them at their
default values? Especially backend_flush_after can have a significant
positive and negative performance impact.
> 16 GB 2 cpu 8 cores
> 200 GB RAID1 HDD, ext4 FS
> Ubuntu 12.04 LTS (precise)
That's with 12.04's standard kernel?
> postgresql.conf:
>shared_buffers = 1GB
>max_wal_size = 1GB
>checkpoint_timeout = 300s
>checkpoint_completion_target = 0.8
>checkpoint_flush_after = { none, 0, 32, 64 }
Did you re-initdb between the runs?
I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.
> Hmmm, interesting: maintenance_work_mem seems to have some influence on
> performance, although it is not too consistent between settings, probably
> because as the memory is used to its limit the performance is quite
> sensitive to the available memory.
That's probably because of differing behaviour of autovacuum/vacuum,
which sometime will have to do several scans of the tables if there are
too many dead tuples.
Regards,
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote:
> I've looked at these patches, especially the whole bench of explanations and
> comments which is a good source for understanding what is going on in the
> WAL writer, a part of pg I'm not familiar with.
>
> When reading the patch 0002 explanations, I had the following comments:
>
> AFAICS, there are several levels of actions when writing things in pg:
>
> 0: the thing is written in some internal buffer
>
> 1: the buffer is advised to be passed to the OS (hint bits?)
Hint bits aren't related to OS writes. They're about information like
'this transaction committed' or 'all tuples on this page are visible'.
> 2: the buffer is actually passed to the OS (write, flush)
>
> 3: the OS is advised to send the written data to the io subsystem
> (sync_file_range with SYNC_FILE_RANGE_WRITE)
>
> 4: the OS is required to send the written data to the disk
> (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)
We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) -
the guarantees it gives aren't well defined, and actually changed across
releases.
0002 is about something different, it's about the WAL writer. Which
writes WAL to disk, so individual backends don't have to. It does so in
the background every wal_writer_delay or whenever a tranasaction
asynchronously commits. The reason this interacts with checkpoint
flushing is that, when we flush writes on a regular pace, the writes by
the checkpointer happen inbetween the very frequent writes/fdatasync()
by the WAL writer. That means the disk's caches are flushed every
fdatasync() - which causes considerable slowdowns. On a decent SSD the
WAL writer, before this patch, often did 500-1000 fdatasync()s a second;
the regular sync_file_range calls slowed down things too much.
That's what caused the large regression when using checkpoint
sorting/flushing with synchronous_commit=off. With that fixed - often a
performance improvement on its own - I don't see that regression anymore.
> After more considerations, my final understanding is that this behavior only
> occurs with "asynchronous commit", aka a situation when COMMIT does not wait
> for data to be really fsynced, but the fsync is to occur within some delay
> so it will not be too far away, some kind of compromise for performance
> where commits can be lost.
Right.
> Now all this is somehow alien to me because the whole point of committing is
> having the data to disk, and I would not consider a database to be safe if
> commit does not imply fsync, but I understand that people may have to
> compromise for performance.
It's obviously not applicable for every scenario, but in a *lot* of
real-world scenario a sub-second loss window doesn't have any actual
negative implications.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2016-02-11 19:44:25 +0100, Andres Freund wrote:
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently. The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
>
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
> potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
> flushing every wal_writer_delay ms or wal_writer_flush_after
> bytes.
I've pushed these after some more polishing, now working on the next
two.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hello Andres,
0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
flushing every wal_writer_delay ms or wal_writer_flush_after
bytes.
I've looked at these patches, especially the whole bench of explanations
and comments which is a good source for understanding what is going on in
the WAL writer, a part of pg I'm not familiar with.
When reading the patch 0002 explanations, I had the following comments:
AFAICS, there are several levels of actions when writing things in pg:
0: the thing is written in some internal buffer
1: the buffer is advised to be passed to the OS (hint bits?)
2: the buffer is actually passed to the OS (write, flush)
3: the OS is advised to send the written data to the io subsystem
(sync_file_range with SYNC_FILE_RANGE_WRITE)
4: the OS is required to send the written data to the disk
(fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)
It is not clear when reading the text which level is discussed. In
particular, I'm not sure that "flush" refers to level 2, which is
misleading. When reading the description, I'm rather under the impression
that it is about level 4, but then if actual fsync are performed every 200
ms then the tps would be very low...
After more considerations, my final understanding is that this behavior
only occurs with "asynchronous commit", aka a situation when COMMIT does
not wait for data to be really fsynced, but the fsync is to occur within
some delay so it will not be too far away, some kind of compromise for
performance where commits can be lost.
Now all this is somehow alien to me because the whole point of committing
is having the data to disk, and I would not consider a database to be safe
if commit does not imply fsync, but I understand that people may have to
compromise for performance.
Is my understanding right?
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Thu, Feb 11, 2016 at 1:44 PM, Andres Freund wrote:
> On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
>> Fabien asked me to post a new version of the checkpoint flushing patch
>> series. While this isn't entirely ready for commit, I think we're
>> getting closer.
>>
>> I don't want to post a full series right now, but my working state is
>> available on
>> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
>> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
>
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently. The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
>
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
> potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
> flushing every wal_writer_delay ms or wal_writer_flush_after
> bytes.
I previously reviewed 0001 and I think it's fine. I haven't reviewed
0002 in detail, but I like the concept.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> Fabien asked me to post a new version of the checkpoint flushing patch
> series. While this isn't entirely ready for commit, I think we're
> getting closer.
>
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
The first two commits of the series are pretty close to being ready. I'd
welcome review of those, and I plan to commit them independently of the
rest as they're beneficial independently. The most important bits are
the comments and docs of 0002 - they weren't particularly good
beforehand, so I had to rewrite a fair bit.
0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
flushing every wal_writer_delay ms or wal_writer_flush_after
bytes.
Greetings,
Andres Freund
>From f3bc3a7c40c21277331689595814b359c55682dc Mon Sep 17 00:00:00 2001
From: Andres Freund
Date: Thu, 11 Feb 2016 19:34:29 +0100
Subject: [PATCH 1/6] Allow SetHintBits() to succeed if the buffer's LSN is new
enough.
Previously we only allowed SetHintBits() to succeed if the commit LSN of
the last transaction touching the page has already been flushed to
disk. We can't generally change the LSN of the page, because we don't
necessarily have the required locks on the page. But the required LSN
interlock does not require the commit record to be flushed, it just
requires that the commit record will be flushed before the page is
written out. Therefore if the buffer LSN is newer than the commit LSN,
the hint bit can be safely set.
In a number of scenarios (e.g. pgbench) this noticeably increases the
number of hint bits are set. But more importantly it also keeps the
success rate up when flushing WAL less frequently. That was the original
reason for commit 4de82f7d7, which has negative performance consequences
in a number of scenarios. This will allow a follup commit to reduce the
flush rate.
Discussion: 20160118163908.gw10...@awork2.anarazel.de
---
src/backend/utils/time/tqual.c | 21 +
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 465933d..503bd1d 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -89,12 +89,13 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
* Set commit/abort hint bits on a tuple, if appropriate at this time.
*
* It is only safe to set a transaction-committed hint bit if we know the
- * transaction's commit record has been flushed to disk, or if the table is
- * temporary or unlogged and will be obliterated by a crash anyway. We
- * cannot change the LSN of the page here because we may hold only a share
- * lock on the buffer, so we can't use the LSN to interlock this; we have to
- * just refrain from setting the hint bit until some future re-examination
- * of the tuple.
+ * transaction's commit record is guaranteed to be flushed to disk before the
+ * buffer, or if the table is temporary or unlogged and will be obliterated by
+ * a crash anyway. We cannot change the LSN of the page here because we may
+ * hold only a share lock on the buffer, so we can only use the LSN to
+ * interlock this if the buffer's LSN already is newer than the commit LSN;
+ * otherwise we have to just refrain from setting the hint bit until some
+ * future re-examination of the tuple.
*
* We can always set hint bits when marking a transaction aborted. (Some
* code in heapam.c relies on that!)
@@ -122,8 +123,12 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
/* NB: xid must be known committed here! */
XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid);
- if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
- return;/* not flushed yet, so don't set hint */
+ if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) &&
+ BufferGetLSNAtomic(buffer) < commitLSN)
+ {
+ /* not flushed and no LSN interlock, so don't set hint */
+ return;
+ }
}
tuple->t_infomask |= infomask;
--
2.7.0.229.g701fa7f
>From e4facce2cf8b982408ff1de174cffc202852adfd Mon Sep 17 00:00:00 2001
From: Andres Freund
Date: Thu, 11 Feb 2016 19:34:29 +0100
Subject: [PATCH 2/6] Allow the WAL writer to flush WAL at a reduced rate.
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al. But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4. The reason for this slowdown is
that if there are independent

I think I would appreciate comments to understand why/how the
ringbuffer is used, and more comments in general, so it is fine if you
improve this part.
I'd suggest to leave out the ringbuffer/new bgwriter parts.
Ok, so the patch would only onclude the checkpointer stuff.
I'll look at this part in detail.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On February 9, 2016 10:46:34 AM GMT+01:00, Fabien COELHO
wrote:
>
>>> I think I would appreciate comments to understand why/how the
>>> ringbuffer is used, and more comments in general, so it is fine if
>you
>>> improve this part.
>>
>> I'd suggest to leave out the ringbuffer/new bgwriter parts.
>
>Ok, so the patch would only onclude the checkpointer stuff.
>
>I'll look at this part in detail.
Yes, that's the more pressing part. I've seen pretty good results with the new
bgwriter, but it's not really worthwhile until sorting and flushing is in...
Andres
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi Fabien,
On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
>
> The main changes are that:
> 1) the significant performance regressions I saw are addressed by
>changing the wal writer flushing logic
> 2) The flushing API moved up a couple layers, and now deals with buffer
>tags, rather than the physical files
> 3) Writes from checkpoints, bgwriter and files are flushed, configurable
>by individual GUCs. Without that I still saw the spiked in a lot of
> circumstances.
>
> There's also a more experimental reimplementation of bgwriter, but I'm
> not sure it's realistic to polish that up within the constraints of 9.6.
Any comments before I spend more time polishing this? I'm currently
updating docs and comments to actually describe the current state...
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hello Andres,
Any comments before I spend more time polishing this?
I'm running tests on various settings, I'll send a report when it is done.
Up to now the performance seems as good as with the previous version.
I'm currently updating docs and comments to actually describe the
current state...
I did notice the mismatched documentation.
I think I would appreciate comments to understand why/how the ringbuffer
is used, and more comments in general, so it is fine if you improve this
part.
Minor details:
"typedefs.list" should be updated to WritebackContext.
"WritebackContext" is a typedef, "struct" is not needed.
I'll look at the code more deeply probably over next weekend.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2016-02-08 19:52:30 +0100, Fabien COELHO wrote:
> I think I would appreciate comments to understand why/how the ringbuffer is
> used, and more comments in general, so it is fine if you improve this part.
I'd suggest to leave out the ringbuffer/new bgwriter parts. I think
they'd be committed separately, and probably not in 9.6.
Thanks,
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi,
Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.
I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
The main changes are that:
1) the significant performance regressions I saw are addressed by
changing the wal writer flushing logic
2) The flushing API moved up a couple layers, and now deals with buffer
tags, rather than the physical files
3) Writes from checkpoints, bgwriter and files are flushed, configurable
by individual GUCs. Without that I still saw the spiked in a lot of
circumstances.
There's also a more experimental reimplementation of bgwriter, but I'm
not sure it's realistic to polish that up within the constraints of 9.6.
Regards,
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers