How We Reduced Our Amazon Redshift Cost by 28%

September 18th 2018

by
Stefan Gromoll

Reading time:
7 Min

Scale down, if you dare

Amazon Redshift makes it easy to scale. Need more computing power or disk space? Simply add a few nodes with a click or two in the AWS Console and after a couple of hours of Read-Only mode your data warehouse will be ready to crunch your valuable and growing dataset faster than ever.

This ability to scale with minimal effort is also great when it comes to overcoming expected (or unexpected) short-lived spikes in cluster load. Need to power through that re-processing of a particularly large dataset? Add a few nodes. Need more disk space so you can do a deep copy of your largest table without running out of disk space? Clickity-click, done.

But while its just as easy to scale a cluster down as it is to scale a cluster up, too often the decision to remove nodes is a harder one to make since it can result in an overloaded cluster. Who needs frantic support requests when queries slow to a crawl or your cluster runs out of disk space? It’s too easy to throw money at the problem and take the attitude, don’t fix what ain’t broke.

And to make things more time-consuming, changing your cluster size requires re-tuning (or at least reviewing) your WLM configuration, since the total memory and concurrency available for your queries will change. (We’ve covered tuning WLMs in another blog post)

As a result, many Redshift customers run with over-provisioned clusters simply because they worry that removing nodes will cause their cluster to go down in flames, resulting in late hours making data pipeline changes or waiting for the cluster to scale back up.

So too often we see Amazon Redshift users spend valuable dollars from their budgets on resources they may not need.

How node count effects cluster performance

Before we look at how to evaluate whether nodes can be removed from your cluster, we should first look at how changing the number of nodes in a cluster affects its performance. Intuitively we know that adding nodes should do something like “increase cluster performance.” But what exactly does that mean, and what should you measure to see if the number of nodes you have in your cluster is the right number?

In the context of your queries, adding nodes to a Redshift cluster does multiple things:

It gives your cluster more storage space. For example, adding a dc2.large node will give you an additional 160gb of storage (which will be utilized depending on your tables’ DISTKEY setup). This additional disk space is used not just for storing your data, but also as temporary storage when running disk-based queries (e.g. large joins).

It gives you more memory to allocate to your queries. This means you can either increase the amount of memory allocated per query, or increase the concurrency of your queries (i.e. increase the WLM slot count). If you are seeing a large % of disk-based queries, you may be able to optimized your WLMs after adding nodes to try to to move some of those queries into memory.

It gives your queries more CPU. This means that you can increase your query concurrency in your WLM if your queries are CPU-bound (though more often than not concurrency is limited by the amount of memory you have available to allocate per slot rather than available CPU).

It gives you more IO for getting data into and out of your cluster. Redshift is very efficient at copying data to and from S3 because it parallelizes transfers among the slices (one virtual CPU, share of memory, and disk) in your cluster. So adding nodes may make your COPYs and UNLOADs faster.

Measure twice, cut once

So how do you optimize your Redshift spend while reducing the risk of bringing your pipeline to a screeching halt by accidentally under-powering your cluster? Measure, measure, measure. (Ok, that’s measuring three times, but measurement really is that important).

Since we now understand how adding a node impacts cluster performance, we can look at the specific metrics you should measure when considering whether you can remove nodes from your Redshift cluster. These are:

Current and historic disk utilization

WLM concurrency and queue wait-times

% disk-based queries

Copy and Unload times

In the following section we’ll walk through an example analysis of these metrics for one of our own Redshift clusters to see if we can remove some nodes to save money.

A case study

The cluster we’re going to analyze has seven dc2.large nodes, and we want to see if we can remove some nodes from it.

1. Disk Utilization

Our first step is to see if we need the nodes simply to store our data. Since Amazon Redshift’s disk, memory, and CPU all scale together (in units of nodes), we can’t remove a node if we need that node just for data storage. However, in that case we should consider other solutions to reduce disk utilization so that we can remove a node. For example, changing compression encodings or pruning our tables – for more detail, see our top 14 performance tuning techniques.

We start by looking at the historical disk utilization of the cluster. You can view this in the AWS Redshift dashboard under PercentageDiskSpaceUsed, or in your Intermix dashboard under Storage Analysis. In this case, we’re looking at the last day and a half of disk utilization, which we know represents the current typical workload for this cluster (you may want to look at a longer period of time, depending on what you consider a representative time period is):

We can see that our cluster has used between 30% and 58% of its disk space during this period. Since our cluster has seven dc2.large nodes, the total cluster storage is 1.12 Tb (160 GB * 7). So we’re using between 336 Gb (30%) and 650 Gb (58%) of storage at any given time.

This means that if we want to keep our disk utilization no higher than 70-80% (a good rule of thumb to leave headroom for query execution), we can potentially reduce our cluster size from seven to five, resulting in a total storage of 5 * 160 Gb = 800 Gb. Our peak disk utilization of 650 Gb will be approximately 81%, which is acceptable since the peak is short-lived.

Note that these calculations are all approximate, since your cluster may use a different amount of disk for disk-based queries after the resize, so it is important to look at the characteristics of your disk-based queries too (see below). However, it gives a good starting point for estimating the minimum number of nodes for your cluster.

So now that we know we may only need five nodes to support our typical data storage needs for the cluster, lets move on to our WLM concurrency.

2. WLM Concurrency

The next step is to look at the query concurrency in your cluster to see if the cluster is operating without significant queue wait time, i.e. that queries aren’t sitting around waiting for cluster resources to become available. In your Intermix.io dashboard, you can find this under Throughput Analysis:

This example shows that our four WLM queues are operating with very little queue wait time (a peak queue wait time of around 0.15 sec, which is well below what is acceptable for the workload on this cluster). So this means that as long as we can keep sufficient memory in each queue after the resize, we should be ok. Which brings us to…

3. % disk-based queries

Next we evaluate whether the cluster has a large percentage of disk-based queries. To do that, we look at the Memory Analysis page for our cluster in the Intermix dashboard:

The left charts show the total amount of memory consumed by the queries running in each WLM queue at any given time, and the right charts show the number of disk-based queries. By looking at the breakdown of memory usage per queue, we can make the following observations:

No queues have more than 1-2% disk-based queries. Typically we try to keep % disk-based queries < 10%, so we have plenty of headroom.

The non disk-based queries in queues 1 and 3 typically use 30-80 mb of memory each, so we have some headroom in the amount of memory allocated per slot (87mb for Queue 1 and 137mb for queue 3). This means that while we expect our % of disk-based queries to increase after removing two nodes, reducing the total memory by 28% (2 out of 7 nodes) shouldn’t be a change that brings the cluster to its knees (i.e. we’re not running right at the limit of our memory per slot).

Most of the disk-based queries in queue 3 are due to a few very large (8 gb) queries. Changing the total memory of the cluster by reducing nodes won’t affect this.

4. COPY and UNLOAD times

The final metric to look at is the COPY and UNLOAD times, since reducing nodes will reduce the number of slices available for parallel data transfers into and out of the cluster. What we want to look for here is that our COPYs and UNLOADs are happening quickly enough such that reducing the available parallelism by ~30% isn’t likely to have an impact. In our case, we look at all COPYs and UNLOADs for the cluster, and compare the average time to our internal requirement for the workflows running on this cluster. For this cluster, we need COPYs and UNLOADs to finish in less than 5 minutes on average:

Since COPYs and UNLOADs are both finishing in around 6 seconds, removing 2 of the 7 nodes shouldn’t cause a performance problem for transfers.

5. Pulling the trigger

After studying the metrics described above, we concluded that we canremove two nodes from our cluster without impacting our workloads. And lo and behold, we did the resize, retuned our WLMs, and re-reviewed our Storage, Throughput, and Memory metrics to verify that all is well–and all was.

We reduced the cost of operating this cluster by 28% with no negative impact on our pipeline performance!

When it comes to making infrastructure changes that could impact mission-critical data pipelines, the more information you have up-front the more secure you’ll feel. And the more secure you feel, the quicker you can iterate.