Blog | Crunchy Data

New features and better performance get a lot of attention, but one of the relatively unsung improvements in PostGIS over the past ten years has been inclusion in standard software repositories, making installation of this fairly complex extension a "one click" affair.

Once you've got PostgreSQL/PostGIS installed though, how are upgrades handled? The key is having the right versions in place, at the right time, for the right scenario and knowing a little bit about how PostGIS works.

Common table expressions, aka CTEs, aka WITH queries, are not only the gateway to writing recursive SQL queries, but also help developers write maintainable SQL. WITH query clauses can help developers who are more comfortable writing in imperative languages to feel more comfortable writing SQL, as well as help reduce writing redundant code by reusing a particular common table expressions multiple times in a query.

A new patch, scheduled to be a part of PostgreSQL 12 major release later in the year, introduces the ability, under certain conditions, to inline common table expressions within a query. This is a huge feature: many developers could suddenly see their existing queries speed up significantly, and the ability to explicitly specify when to inline (i.e. the planner "substitutes" a reference to the CTE in the main query and can then optimize further) or, conversely, materialize (i.e. place the CTE into memory but lose out on certain planning & execution optimizations).

But why is this a big deal? Before we look into the future, first let's understand how WITH queries currently work in PostgreSQL.

I want to take a few minutes for a deep dive into the effect your data model has on storage density when using PostgreSQL. When this topic came up with a customer, I explained my thoughts on the matter, but I realized at the time that I had never done a reasonably careful apples-to-apples test to see just exactly what the effect is, at least for a model sample size of one. So here it is.

One of the toughest challenges facing database engineers today is ensuring their data is always accessible so they can meet the high-availability requirements for their applications.

While this problem may seem trivial in the world of applications where one can have many instances behind geographically distributed load balancers, in the database world where there is only one version of the data globally, the problem faces many obstacles.

PostgreSQL replication has advanced considerably in recent major releases, including continuous improvements to streaming replication and the addition of logical replication in PostgreSQL 10. While these capabilities represent important enhancements to PostgreSQL, enabling users to address a wide variety of clustered architectures, they do not (easily) address the use cases where the application requires access to an updatable database in two or more geographic regions - often referred to as an "active-active" cluster.

More specifically, an active-active cluster is one where the application can write to any instance in the cluster and the data will be written to all of the instances in the cluster, enabling each instance in the cluster to be used to:

Provide near zero downtime as the new instance is already in a read/write state; there is no need to reconfigure it.

Provide near zero downtime upgrades from one version to another

Improve latency for users in geographically distributed clusters. By providing an instance physically closer to the user latency is reduced.

While there are a number of proprietary solutions that attempt to address active-active PostgreSQL requirements, this post and a series to follow provides users with potential reference architectures and configurations that enable active-active PostgreSQL configurations using entirely open source software.

This post is of course only one approach to deploying an active-active PostgreSQL cluster. There are other ways to deploy an active-active setup with PostgreSQL. I will cover some ways to do this in the future - stay tuned!

This latest release provides further feature enhancements designed to support users intending to deploy large-scale PostgreSQL clusters on Kubernetes, with enterprise high-availability and disaster recovery requirements.

When combined with the Crunchy PostgreSQL Container Suite, the PostgreSQL Operator provides an open source, Kubernetes-native PostgreSQL-as-a-Service capability.

An important part of running a production PostgreSQL database system (and for that matter, any database software) is to ensure you are prepared for disaster. There are many ways to go about preparing your system for disaster, but one of the simplest and most effective ways to do this is by taking periodic backups of your database clusters.

How does one typically go about setting up taking a periodic backup? If you’re running PostgreSQL on a Linux based system, the solution is to often use cron, and setting up a crontab entry similar to this in your superuser account:

# take a daily base backup at 1am to a mount point on an external disk
# using pg_basebackup
0 1 * * * /usr/bin/env pg_basebackup –D /your/external/mount/

However, if you’re managing tens, if not hundreds and thousands of PostgreSQL databases, this very quickly becomes an onerous task and you will need some automation to help you scale your disaster recovery safely and efficiently.

Automating Periodic Backups

The Crunchy PostgreSQL Operator, an application for managing PostgreSQL databases in a Kubernetes-based environment in is designed for managing thousands of PostgreSQL database from a single interface to help with challenges like the above. One of the key features of the PostgreSQL Operator is to utilize Kubernetes Labels to apply commands across many PostgreSQL databases. Later in this article, we will see how we can take advantage of labels in order to set backup policies across many clusters.

For most major upgrades using a utility such as pg_upgrade or a replication tool such as pglogical will be the best solution. However if these options are not available, pg_dumpall can be used to perform a major upgrade. What follows is a guide on how you can safely upgrade your database to a newer version of PostgreSQL with pg_dumpall.

Many applications these days want us to know how close we are to things:

What are the three closest coffee shops to my current location?

Which is the nearest airport to the office?

What are the two closest subway stops to the restaurant?

and countless more examples.

Another way of asking these questions is to say “who are my nearest neighbors to me?” This maps to a classic algorithmic problem: efficiently finding the K-nearest neighbors (or K-NN), where K is a constant. For example, the first question would be a 3-NN problem as we are trying to find the 3 closest coffee shops.

(If you are interested in learning more about K-NN problems in general, I highly recommend looking at how you can solve this using n-dimensional Voronoi diagrams, a wonderful data structure developed in the field of computational geometry.)

How can we use PostgreSQL to help us quickly find our closest neighbors? Let’s explore.

The PostgreSQL 11 release is nearly here (maybe in the next couple of weeks?!), and while a lot of the focus will be on the improvements to the overall performance of the system (and rightly so!), it's important to notice some features that when used appropriately, will provide noticeable performance improvements to your applications.

One example of such feature is the introduction of "covering indexes" for B-tree indexes. A covering index allows a user to perform an index-only scan if the select list in the query matches the columns that are included in the index. You can specify the additional columns for the index using the "INCLUDE" keyword, e.g.

CREATE INDEX a_b_idx ONx(a,b)INCLUDE(c);

Theoretically, this can reduce the amount of I/O your query needs to use in order to retrieve information (traditionally, I/O is the biggest bottleneck on database systems). Additionally, the data types included in a covering index do not need to be B-tree indexable; you can add any data type to the INCLUDE part of a CREATE INDEX statement.

However, you still need to be careful how you deploy covering indexes: each column you add to the index still takes up space on disk, and there is still a cost for maintaining the index, for examples, on row updates.

Understanding these trade offs, you can still apply covering indexes in very helpful ways that can significantly help your applications.