Understanding postgresql.conf : work_mem

Specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults to one megabyte (1MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.

So, it looks pretty well described, but let's see it in some examples.

We know, that work_mem is limit of memory allocated to operation, where the operation is related to sorting or hashes.

We see that for sorting 10 rows PostgreSQL used 25kB of RAM. Current limit is:

$ SHOW work_mem ;
work_mem
----------
1MB
(1ROW)

So, let's see how it will work out for larger sets of data, still with 1MB of work_mem. I ran the explain 3 times with rowcounts 10, 100, 1000, 10k, 100k, and 1M. Results:

rows

avg. time

Sort method

10

0.072 ms

quicksort Memory: 26kB

100

0.258 ms

quicksort Memory: 33kB

1000

3.421 ms

quicksort Memory: 114kB

10000

42.047 ms

external merge Disk: 448kB

100000

498.615 ms

external merge Disk: 4440kB

1000000

6721.819 ms

external merge Disk: 44504kB

As you can see when we hit 10k rows PostgreSQL switched from quicksort in memory, to external merge method. Interestingly – time didn't increase (i.e. time per row), but that's just because we're dealing with (still) relatively small datasets, and my test machine has pretty much memory, so kernel caches most of temporary files data.

You might wander, though, why PostgreSQL switched to Disk, when it used only 448kB? After all, work_mem is 1MB. Answer is pretty simple – as I understand – disk is used when work_mem is not enough, so it means it's already been filled. So, sort with “Disk: 448kB" would mean that more or less whole work_mem has been used plus 448kB of disk.

So, what does it help to have higher work_mem? The idea is that having higher work_mem let's the hash be larger with fewer “Batches", and thus make it faster. Why “the idea"? When I tried to test it I got:

Frankly – I don't quite understand it, perhaps someone can explain this result to me in comments. For now I assume it's just a matter of specifics of the data I have here.

As I tried to show, increasing work_mem can lead to faster sorts (thanks to sort being done in memory, and not spilling to disk). Thanks to this, operations that depend on sorting (aggregates (sometimes), distinct, merge joins) can be faster.

So, why not simply set work_mem to 1GB, and be happy with it?

The problem lies in the fact that every execution node that uses work_mem, can use it up to the limit.

So, if we'd have query that executes plan like this, where there are 101 Sort or Hash operations, you'd risk using up to 101 * work_mem of memory. In single DB connection!.

To show some perspective – Let's assume you have some queries that do 4-10 Sort/Hash operations. And you have 100 max_connections limit. This means that theoretically you can get yp to 1000 x work_mem usage! (100 connections, each running the 10-sort/hash query).

Above means, that while increasing work_mem is cool, you have to be sure not create situation when pg will try to allocate more memory than there physically is. A lot of “out of memory" errors reported to PostgreSQL mailing lists came from users which set their work_mem to large values like 512MB or more, and then ran some complex queries. On machine with (relatively) small amount of ram.

What has to be noted though, is that you can change work_mem using normal sql query – and it will be changed for this particular session only.

So, it might make sense to set work_mem in postgresql.conf to relatively low value (1-10MB), and then, find out which queries would use more, and change the application to issue

SET work_mem ='100MB'

before running them, and

reset work_mem

afterwards.

This of course raises question: how do I know which queries could use more ram? It's simple – just change logging configuration – set log_temp_files to 0 (to log all temp files), and then just check logs every so often. Or check slowest queries, their execution plans, if they have disk based sorts.

As for values of work_mem. There is a long standing idea/rumour that work_mem is internally limited to 1GB, and PostgreSQL will not use more than that even when having set work_mem = '10GB'. Well, I tested it, and for query:

explain analyze select * from lima where mike < 9000000 order by tango

Finally, the last usage of work mem is when you're dealing with GIN indexes:

During a series of insertions into an existing GIN index that has FASTUPDATE enabled, the system will clean up the pending-entry list whenever it grows larger than work_mem. To avoid fluctuations in observed response time, it's desirable to have pending-list cleanup occur in the background (i.e., via autovacuum). Foreground cleanup operations can be avoided by increasing work_mem or making autovacuum more aggressive. However, enlarging work_mem means that if a foreground cleanup does occur, it will take even longer.

GIN indexes are not very common - so it might be irrelevant for your case. The problem seems to be well cleared - when you're doing lots of insertions to GIN - like import of data to table which has GIN index on tsvector for full text search - increasing work_mem can make the process faster (if work_mem is larger than changes in index).

10 thoughts on “Understanding postgresql.conf : work_mem”

I’ve noticed when sorting spills to disk, and I increase the work_mem a bit but not enough to put the whole sort in memory, it still uses the same amount of disk. So I’m not sure this is correct: “So, sort with “Disk: 448kB” would mean that more or less whole work_mem has been used plus 448kB of disk.”

I don’t have numbers with me but I was seeing eg:
work_mem = 256MB, used 100MB disk
work_mem = 512MB, used 100MB disk
work_mem = 1024MB, used 0 disk (suddenly fits all in memory)

I think you can set it larger in some versions of PostgreSQL, but it still won’t use an amount beyond this limit. Your example only proved it will go usefully higher than 1GB, I doubt you can find any version where it will go over 2GB. On the 9.1 server I do testing on here, trying to set it higher than 2GB does give the appropriate error.

The suggested explanation for why the work_mem transition point to disk isn’t so clear isn’t right, as demonstrated by Nathan Thom’s example. The amount of RAM taken up to run the quicksort and the amount needed to do the merge sort are not the same. Merge sorts actually use quite a bit less memory than a quicksort on the same size data set. You can see this in the article’s sample table too. Note how with the 1000 element example, quicksort uses 114kB of RAM. A quicksort of 10000 elements would take closer to 1MB of RAM, but since it switches to the more disk efficient disk merge sort there it’s only 448KB instead. The disk-based merge sort doesn’t use work_mem worth of caching and spill the remains; it’s all done on disk, and the amount reported is the total.

I’m not a PG internals guy but as to the reason that a hash using more memory ended up being slower than a smaller hash the broke the workload into batches, the most logical explanation for me is to do with CPU caches. The smaller hash size was just 544KB which, depending on the CPU, will fit entirely within the cache. With the 67MB example that will definitely not fit within the cache and so will be much more reliant on the speed of main memory.

Depending on the way in which Postgres breaks the task up in to smaller batches it is conceivable that lots of small fast operations are quicker than one big but slower operation.

As for your explanation of memory usage – it definitely makes sense. Thanks. This makes me think – wouldn’t it be possible to make disk based sorts use memory for as long as they can, and then spill to disk?

@Chaw:
increasing? most likely no. What is your: shared_buffers, work_mem, and explain of the query that was causing it? Also – what OS and architecture is it (32 bit? 64 bit?)? are there any system limitations (ulimit)?