68.2. TOAST

This section provides an overview of TOAST (The Oversized-Attribute Storage
Technique).

PostgreSQL uses a fixed page
size (commonly 8 kB), and does not allow tuples to span multiple
pages. Therefore, it is not possible to store very large field
values directly. To overcome this limitation, large field values
are compressed and/or broken up into multiple physical rows. This
happens transparently to the user, with only small impact on most
of the backend code. The technique is affectionately known as
TOAST (or “the best thing since sliced
bread”). The TOAST infrastructure is also used to improve
handling of large data values in-memory.

Only certain data types support TOAST — there is no need to impose the
overhead on data types that cannot produce large field values. To
support TOAST, a data type
must have a variable-length (varlena)
representation, in which, ordinarily, the first four-byte word of
any stored value contains the total length of the value in bytes
(including itself). TOAST does
not constrain the rest of the data type's representation. The
special representations collectively called TOASTed
values work by modifying or reinterpreting this initial
length word. Therefore, the C-level functions supporting a
TOAST-able data type must be
careful about how they handle potentially TOASTed input values: an input might not
actually consist of a four-byte length word and contents until
after it's been detoasted. (This is
normally done by invoking PG_DETOAST_DATUM before doing anything with an
input value, but in some cases more efficient approaches are
possible. See Section 38.12.1 for
more detail.)

TOAST usurps two bits of
the varlena length word (the high-order bits on big-endian
machines, the low-order bits on little-endian machines), thereby
limiting the logical size of any value of a TOAST-able data type to 1 GB (230
- 1 bytes). When both bits are zero, the value is an ordinary
un-TOASTed value of the data
type, and the remaining bits of the length word give the total
datum size (including length word) in bytes. When the
highest-order or lowest-order bit is set, the value has only a
single-byte header instead of the normal four-byte header, and
the remaining bits of that byte give the total datum size
(including length byte) in bytes. This alternative supports
space-efficient storage of values shorter than 127 bytes, while
still allowing the data type to grow to 1 GB at need. Values with
single-byte headers aren't aligned on any particular boundary,
whereas values with four-byte headers are aligned on at least a
four-byte boundary; this omission of alignment padding provides
additional space savings that is significant compared to short
values. As a special case, if the remaining bits of a single-byte
header are all zero (which would be impossible for a
self-inclusive length), the value is a pointer to out-of-line
data, with several possible alternatives as described below. The
type and size of such a TOAST pointer
are determined by a code stored in the second byte of the datum.
Lastly, when the highest-order or lowest-order bit is clear but
the adjacent bit is set, the content of the datum has been
compressed and must be decompressed before use. In this case the
remaining bits of the four-byte length word give the total size
of the compressed datum, not the original data. Note that
compression is also possible for out-of-line data but the varlena
header does not tell whether it has occurred — the content of the
TOAST pointer tells that,
instead.

As mentioned, there are multiple types of TOAST pointer datums. The oldest and most
common type is a pointer to out-of-line data stored in a
TOAST
table that is separate from, but associated with, the table
containing the TOAST pointer
datum itself. These on-disk pointer
datums are created by the TOAST management code (in access/heap/tuptoaster.c) when a tuple to be
stored on disk is too large to be stored as-is. Further details
appear in Section 68.2.1.
Alternatively, a TOAST pointer
datum can contain a pointer to out-of-line data that appears
elsewhere in memory. Such datums are necessarily short-lived, and
will never appear on-disk, but they are very useful for avoiding
copying and redundant processing of large data values. Further
details appear in Section 68.2.2.

The compression technique used for either in-line or
out-of-line compressed data is a fairly simple and very fast
member of the LZ family of compression techniques. See
src/common/pg_lzcompress.c for the
details.

68.2.1. Out-of-line, on-disk TOAST
storage

If any of the columns of a table are TOAST-able, the table will have an
associated TOAST table,
whose OID is stored in the table's pg_class.reltoastrelid entry. On-disk
TOASTed values are kept in
the TOAST table, as
described in more detail below.

Out-of-line values are divided (after compression if used)
into chunks of at most TOAST_MAX_CHUNK_SIZE bytes (by default this
value is chosen so that four chunk rows will fit on a page,
making it about 2000 bytes). Each chunk is stored as a separate
row in the TOAST table
belonging to the owning table. Every TOAST table has the columns chunk_id (an OID identifying the
particular TOASTed value),
chunk_seq (a sequence number
for the chunk within its value), and chunk_data (the actual data of the chunk).
A unique index on chunk_id and
chunk_seq provides fast
retrieval of the values. A pointer datum representing an
out-of-line on-disk TOASTed
value therefore needs to store the OID of the TOAST table in which to look and the OID of
the specific value (its chunk_id). For convenience, pointer datums
also store the logical datum size (original uncompressed data
length) and physical stored size (different if compression was
applied). Allowing for the varlena header bytes, the total size
of an on-disk TOAST pointer
datum is therefore 18 bytes regardless of the actual size of
the represented value.

The TOAST management code
is triggered only when a row value to be stored in a table is
wider than TOAST_TUPLE_THRESHOLD
bytes (normally 2 kB). The TOAST code will compress and/or move field
values out-of-line until the row value is shorter than
TOAST_TUPLE_TARGET bytes (also
normally 2 kB, adjustable) or no more gains can be had. During
an UPDATE operation, values of unchanged fields are normally
preserved as-is; so an UPDATE of a row with out-of-line values
incurs no TOAST costs if
none of the out-of-line values change.

The TOAST management code
recognizes four different strategies for storing
TOAST-able columns on
disk:

PLAIN prevents either
compression or out-of-line storage; furthermore it
disables use of single-byte headers for varlena types.
This is the only possible strategy for columns of
non-TOAST-able data
types.

EXTENDED allows both
compression and out-of-line storage. This is the default
for most TOAST-able
data types. Compression will be attempted first, then
out-of-line storage if the row is still too big.

EXTERNAL allows
out-of-line storage but not compression. Use of
EXTERNAL will make substring
operations on wide text and
bytea columns faster (at the
penalty of increased storage space) because these
operations are optimized to fetch only the required parts
of the out-of-line value when it is not compressed.

MAIN allows compression
but not out-of-line storage. (Actually, out-of-line
storage will still be performed for such columns, but
only as a last resort when there is no other way to make
the row small enough to fit on a page.)

Each TOAST-able data type
specifies a default strategy for columns of that data type, but
the strategy for a given table column can be altered with
ALTER TABLE ... SET
STORAGE.

This scheme has a number of advantages compared to a more
straightforward approach such as allowing row values to span
pages. Assuming that queries are usually qualified by
comparisons against relatively small key values, most of the
work of the executor will be done using the main row entry. The
big values of TOASTed
attributes will only be pulled out (if selected at all) at the
time the result set is sent to the client. Thus, the main table
is much smaller and more of its rows fit in the shared buffer
cache than would be the case without any out-of-line storage.
Sort sets shrink also, and sorts will more often be done
entirely in memory. A little test showed that a table
containing typical HTML pages and their URLs was stored in
about half of the raw data size including the TOAST table, and that the main table
contained only about 10% of the entire data (the URLs and some
small HTML pages). There was no run time difference compared to
an un-TOASTed comparison
table, in which all the HTML pages were cut down to 7 kB to
fit.

68.2.2. Out-of-line, in-memory
TOAST storage

TOAST pointers can point
to data that is not on disk, but is elsewhere in the memory of
the current server process. Such pointers obviously cannot be
long-lived, but they are nonetheless useful. There are
currently two sub-cases: pointers to indirect data and pointers to expanded data.

Indirect TOAST pointers
simply point at a non-indirect varlena value stored somewhere
in memory. This case was originally created merely as a proof
of concept, but it is currently used during logical decoding to
avoid possibly having to create physical tuples exceeding 1 GB
(as pulling all out-of-line field values into the tuple might
do). The case is of limited use since the creator of the
pointer datum is entirely responsible that the referenced data
survives for as long as the pointer could exist, and there is
no infrastructure to help with this.

Expanded TOAST pointers
are useful for complex data types whose on-disk representation
is not especially suited for computational purposes. As an
example, the standard varlena representation of a PostgreSQL array includes dimensionality
information, a nulls bitmap if there are any null elements,
then the values of all the elements in order. When the element
type itself is variable-length, the only way to find the
N'th element is to
scan through all the preceding elements. This representation is
appropriate for on-disk storage because of its compactness, but
for computations with the array it's much nicer to have an
“expanded” or “deconstructed”
representation in which all the element starting locations have
been identified. The TOAST
pointer mechanism supports this need by allowing a
pass-by-reference Datum to point to either a standard varlena
value (the on-disk representation) or a TOAST pointer that points to an expanded
representation somewhere in memory. The details of this
expanded representation are up to the data type, though it must
have a standard header and meet the other API requirements
given in src/include/utils/expandeddatum.h. C-level
functions working with the data type can choose to handle
either representation. Functions that do not know about the
expanded representation, but simply apply PG_DETOAST_DATUM to their inputs, will
automatically receive the traditional varlena representation;
so support for an expanded representation can be introduced
incrementally, one function at a time.

TOAST pointers to
expanded values are further broken down into read-write and read-only pointers. The pointed-to
representation is the same either way, but a function that
receives a read-write pointer is allowed to modify the
referenced value in-place, whereas one that receives a
read-only pointer must not; it must first create a copy if it
wants to make a modified version of the value. This distinction
and some associated conventions make it possible to avoid
unnecessary copying of expanded values during query
execution.

For all types of in-memory TOAST pointer, the TOAST management code ensures that no such
pointer datum can accidentally get stored on disk. In-memory
TOAST pointers are
automatically expanded to normal in-line varlena values before
storage — and then possibly converted to on-disk
TOAST pointers, if the
containing tuple would otherwise be too big.

Submit correction

If you see anything in the documentation that is not correct, does not match
your experience with the particular feature or requires further clarification,
please use
this form
to report a documentation issue.