AWS Aurora’s SRE Golden Signals

AWS Aurora is an increasingly popular database engine option, essentially a high-performance upgrade of AWS RDS. Fortunately for us, it also includes very useful monitoring metrics that make it easier to monitor.

As noted in the MySQL’s SRE Golden Signals article, all the good Golden Signals require you to have a connection to the database, which is annoying, and for RDS, requires an agent or some code running on a VM somewhere.

Fortunately Aurora provides what we need through AWS Cloud Watch, so once you can get metrics from that, you are set to go.

A nice thing, this Aurora

Mapping our signals to Aurora, we see:

Request Rate — Queries per second which CloudWatch has as Queries. If you need to break out read vs. write queries, there is both SelectThroughput and DMLThroughput (for inserts, updates, and deletes).

Error Rate — Aurora has some useful metrics for some types of ‘errors, though real SQL errors still require the Performance Schema, see below.

For login failure, you can get the LoginFailures item, which includes users unable to login due to reaching max connections, plus password failures (which can signal a hack attempt).

Another useful metric is BlockedTransactions, which I think means blocked by locks, so any big rise in this means you have locking issues.

If you turn on the Performance Schema you can get a global error rate which includes SQL, syntax, and most all other errors returned by MySQL. This is a counter so you need to apply delta processing. The query is:

Latency — Aurora provides this directly in CloudWatch via two metrics: SelectLatency and DMLLatency. The former is probably the most important as it’s usually where you’ll see app performance issues first, so if you can only alert on one, use that.

Utilization — There are many ways Aurora can run out of capacity, but it’s easiest to use with underlying CPU % and I/O rates, measured by CPUUtilization and ReadIOPS (WriteIOPS is also available but usually less indicative of problems, but Reads will jump due to higher loads or worse SQL).

As you can see, Aurora is far easier than MySQL or even RDS, as it provides direct metrics for most of the things we care about.

PostgreSQL Aurora

All of the above is focused on MySQL, since that’s where our experience is, though looking at the Aurora metrics, most or all of these also apply to pgsql, too.