Bloat removal by tuples moving

Looong time ago, I wrote a piece about removing bloat by moving rows away from the end of table, and vacuuming it.

This is/was very slow, and was optimized (to some extent) by Nathan Thom, but his blogpost vanished. Besides, later on we got great tool: pg_reorg (or, as it's currently named: pg_repack).

But recently I was in position where I couldn't pg_reorg. So I had to look for other way. And I found it 🙂

First, let me describe situation:

I had a 350GB table, which had many rows inserted, but never removed or updated. Then, at some point in time, client made a decision to remove all rows older than 4 months. These rows were removed, but this didn't decrease the table size. Why?

Because all free pages are at the beginning of the table “file" (well, it's several files, but let's assume it's single file).

Normal vacuum can shrink the table only if the trailing page(s) are empty – it just truncates the table file by how many empty pages are at the very end.

You can use VACUUM FULL, but this has drawbacks:

it locks the table for the whole duration

depending on version it might: bloat indexes and/or take ages to finish

The problem was that quick estimate told us that if we removed 100% of bloat, the table (with indexes) would use ~ 150GB, and we had only 50GB of free disk space, so using pg_reorg is not an option.

In case you don't understand why pg_reorg is not an option: it copies the table. So for some time you have both old, bloated table, and new, clean. Thus effectively using much more disk space.

When put in this situation I recalled the compacting tool I wrote almost 3 years ago, but it was just too slow.

The problem is that when doing UPDATE, PostgreSQL marks previous version of row as removed, and inserts new one. But it tries hard to put the new row in the same page (8kB) as previous version. So, to move all rows from the last page, I would have to disable autovacuum, and update them multiple times just so that I'd use *all* of available free space in last page. And then I'd have to repeat it all for 2nd to last page, and so on.

This is extremely time consuming.

But when I was thinking about it, I realized something: update is internally, more or less “delete + insert". And what if I did: delete, and then insert of the same row? Would that also put the new row in the same page? It's testing time.

In case you don't know – ctid is system, hidden, column, which shows where in the file this row is. It is always a pair of numbers. First number is page number, starting from 0. The second number is tuple (record) number within the page, and it's starting with 1.

Based on ctids with page 83332, I see that Pg fits 12 rows per page. So, in the last page (83333) there are still 8 free spaces.

So, let's update the last row:

$ UPDATE test SET id = id WHERE id =1000000;
UPDATE1

This update didn't do anything in terms of data, but it caused Pg to copy the row:

Does it mean that it's a silver bullet? Well, obviously – no. If the table had triggers – it could get problematic (both on delete and on insert triggers would get called). Same happens with foreign keys pointing to this table. But if you don't have these (fkeys to this table and triggers on this table) – then it works just great.

The only thing you have to keep in mind is to update the rows starting from the last ones.

I did it by simply doing : select ctid, primary_key_column from table, storing output in flat file, and then sorting it (outside of database to avoid putting too much load on it). Afterwards, I had sorted list of primary key column values that I should use in my updates.

If you'd want to do it too, one bit of warning: do not run vacuums in the mean time. Current vacuum is doing just some part of all its work, and then exits – to avoid putting too much strain on database.

So, when I ran two sessions:

delete/insert, in batches of 1000 rows, next batch immediately after previous

vacuum of the table, new vacuum immediately after previous

All vacuum did was marking pages as free, but it didn't get to the point of work to actually truncate the table.

Only after I stopped the row moving, vacuum did catchup with marking rows as deleted, and started to truncate the table.

All in all, after 6 hours of row moving, I ran vacuums for another 2 hours (each vacuum took ~ 3 minutes) to get all the disk space reclaimed (all that was possible to reclaim, not all that there was, as vacuum can't reclaim free space from the middle of table).

Oh man, this would be great for a database I have, except that it doesn’t seem to work on my 9.1 install – a delete followed by an insert gets moved to the end of the table. Is it because I’m not moving rows from the last page? If so, that makes this trick useless on busy tables. 🙁

There’s another approach that may be less tedious and is done entirely inside of PostgreSQL: use table inheritance. If it’s of interest, I can put together a post on this, but the gist of the procedure is:

1) BEGIN;
2) Rename big_table to big_table_unvacuumed.
3) Create parent table named big_table
4) Add big_table_unvacuumed to big_table
5) Create a new table big_table_vacuumed and add to big_table
6) Add an INSERT trigger to big_table to make sure it retains zero rows (not strictly necessary, but this is how I like doing things). I would direct new rows to the vacuumed table
7) Commit;

Once the above is done, use a wCTE exactly like what was described to DELETE+INSERT from the _unvacuumed table to big_table_vacuumed. The advantage here is instead of manually having to specify rows in the DELETE’s IN() clause, you can use “WHERE id IN(SELECT * FROM big_table_unvacuumed ORDER BY id DESC LIMIT 1000)” and you automatically get the last 1K rows. And also like what you stated, use a second terminal window that has vacuumdb running in a loop against the unvacuumed table. Rinse repeat until there are 0 rows left in _unvacuumed.

And finally, once you’re done, UNINHERIT and swap big_table_vacuumed with big_table and nuke the residual. Not ideal, but it’s handy when you’re in a jam.

One difference between update and insert+delete is what happens to concurrent updates in read committed mode. If a concurrent update comes across a row already updates by another transaction it will wait to see if that transaction commits and then update the new row (if it still matches the where clause). If it comes across a deleted row it just ignores it.

Read committed mode works the way people expect in the usual cases but replacing rows by deleting them and reinserting them is the kind of that requires predicate locks to get perfect.

Nice job, depesz!
By chance, I researched this model in the last weekend (22 – 23 june) and didn’t see this article.
I have a 4TB table, including 3.5TB of toast table.
In this case, vacuum full it’s a really bad idea.
Unfortunately, the table has some triggers and foreign key.
I also came to the conclusion that the update records is not appropriate, it is compounded by the fact that the table has fillfactor = 80.
My algorithm:
1. Select relpages from pg_class – it’s a last page in table file.
2. Insert into table clones of records on last pages with special mark.
3. Delete original records.
4. Remove special mark by update clones.
5. Vacuum analyze table – it will change relpages value and will truncate empty pages.

As a result, toast table does not change its size.
Do you have any idea how you can truncate toast table without manipulating pg_toast.pg_toast_X table? And how pg_toast records are associated with table records (I need to check data integrity)?

I just wonder how fast would it be in comparison to pgcompactor (https://code.google.com/p/pgtoolkit/)? It uses the same principle (updates) as your previous compacting tool, but more improved and optimized. I managed to compact 0.5TB database (on 6 SAS RAID10) with it in <18h.