3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift

July 10th 2018

by
Lars Kamp

Reading time:
6 Min

Looker is a powerful tool for self-service analytics. A lot of of companies use Looker on top of Amazon Redshift for business intelligence. It helps companies derive value from their data by making it it easy to create custom reports and dashboards.

“Slow Looker dashboards” is one of the most frequent issues we hear with Amazon Redshift. Looker users tell us that some queries that should take seconds to execute takes minutes. Dashboards seem to “hang”.

The problem is probably a mismatch between your Looker workloads and your Amazon Redshift configuration. In this post, I’ll explain the causes of slow dashboards, and how to fine-tune Amazon Redshift to get blazing fast Looker dashboards.

The Problem: Slow Looker Dashboards

A lot of times, analytics stacks grow out of an experiment. Somebody spins up an Amazon Redshift cluster, builds a few data pipelines, and then connects a Looker dashboard to it. The data is popular, so you set more people up with dashboards. And at some point, the problems start.

Complains can range from slow dashboards to long execution times for persistent derived tables (PDTs). In some cases, these issues can even appear at the very start of the journey. Consider this post on the Looker Discourse. The complaint is that “first-run query performance is terrible”.

Image 1: Support Request on Looker Discourse

The key to solving bottlenecks lies in balancing your Looker workloads with your Redshift set-up. So let’s first understand how Amazon Redshift processes queries. Then we’ll look closer at how Looker generates workloads.

Amazon Redshift Workload Management and Query Queues

A key feature in Amazon Redshift is the workload management (WLM) console. Redshift operates in a queuing model. In the WLM console you can set up different query queues, and then assign a specific group of queries to each queue.

For example, you can assign data loads to one and your ad-hoc queries to another. By separating your workloads, you ensure that they don’t block each other. You can also assign the right amount of concurrency aka “slot count” to each queue.

The default configuration for Redshift is one queue with a concurrency of 5. It’s easy to not notice the WLM and the queues when getting started with Redshift. But as your query volumes grows, and you run more than 5 concurrent queries, your queries get stuck in the queue. They’re waiting for other queries to finish. When that happens, you’re experiencing slow dashboards.

Understanding LookML and Persistent Derived Tables

There are two components of the Looker platform – LookML and persistent derived tables (“PDT”) – that make it easy for a company to explore its data.

But we’ll see how they can also generate high query volume with heavy workloads that can slow down a Redshift cluster.

LookML – Abstracting Query Structure from Content

LookML is a data modeling language that separates query structure from content. The query structure (e.g. how to join tables) is independent of the query content (e.g. what columns to access, which functions to compute). A LookML project represents a specific collection of models, views and dashboards. The Looker app uses a LookML model to construct SQL queries and run them against Redshift.

The benefit of separating structure from content is that business users can run queries without writing SQL. That abstraction makes a huge difference. Analysts with SQL skills define the data structure once, in one place (a LookML project). Business users then leverage that data structure to focus on the content they need. No need from them to write complex SQL statements. Looker uses the LookML project to generate ad-hoc queries on the fly. Image 2 describes the process behind LookML.

Persistent Derived Tables

Some Looks create complex queries that need to create temporary tables, e.g. to store an intermediate result of a query. These tables are ephemeral, and the queries to create the table run every time a user requests the data. It’s key for derived tables to perform so they don’t put excessive strain on a cluster.

In some cases, where a query takes a long time to run, creating a so-called PDT (“persistent derived table”) is the better option. Looker writes PDTs into a scratch schema in Redshift, and allows to refresh the PDT on a set schedule. When the user requests the data from the PDT, it has been already created. So compared to temporary tables, PDTs reduce query time and database load.

There’s a natural progression from single queries to PDTs when doing LookML modeling. When you start, you connect all tables into a LookML model to get basic analytics. To get new metrics or roll-ups, and to iterate quickly, you start using derived tables. Then you set it up to persist and leverage PDTs to manage the performance implications.

The Impact of LookML and PDTs on Query Volume

The separation of structure from content via LookML can have dramatic implications for query volume. The SQL structure of one productive analyst can be re-used by countless users.

A Simple Math Example

Consider a simplified scenario with a 1-node Amazon Redshift cluster, 5 business users and a single LookML project. Each user has 10 dashboards with 20 Looks (i.e. a specific chart). Behind each Look is a single query. With each refresh, they will trigger a total of 5 (users) * 10 (dashboards) * 20 (looks) = 1,000 queries.

With a 1-node Amazon Redshift cluster and a default WLM set-up, you will process 5 queries at a time. You’ll need 1,000 / 5 = 200 cycles to process all queries. While 5 queries process, all other queries have to wait in the queue. Image 3 shows a screenshot from the intermix.io dashboards that shows what queue wait time can look like.

Image 3: Queue wait time for Looker Queries in intermix.io

Let’s assume each query takes 15 seconds to run. For all queries to run, we’re looking a total of 200 * 15 = 3,000 seconds (50 minutes). In other words, your last 15-second query will finish running after 50 minutes.

Even if you add a node now, i.e. you double the amount of queries you can process, you’re only cutting that total wait time in half. That’s still 25 minutes.

Now let’s also add PDTs into the mix. Our PDTs will generate more workloads, often with complex, memory-intensive and long-running queries. The PDTs then compete with the already-slow ad-hoc queries for resources.

A remedy can be to throttle the number of per-user queries. Or to reduce the row limit for queries or allow fewer data points. But the whole point of using Looker is to derive meaningful conclusions from huge amounts of data. Imposing query and row limits, or using fewer data points doesn’t make sense.

Download the Top 14 Performance Tuning Techniques for Amazon Redshift

One part of the answer is, of course, to add more nodes to your cluster. But it’s easy to overspend, so the other part is to configure your WLM the right way, and identify workloads in Looker that may be unnecessary, such as high refresh rates. You need to configure your WLM in a way that Looker queries and PDTs have enough concurrency and memory.

The 3 steps:

Optimize your Amazon Redshift WLM for your Looker workloads

Optimize your Looker workloads

Optimize your Amazon Redshift node count

By following these 3 steps, you’ll also be able to optimize your query speeds, your node count and along with it your Redshift spend.

create one queue per workload type and define the appropriate slot-count and memory %-age for each queue

The same logic applies for your Looker queries. Have your Looker queries run in a queue that’s separate from your loads and transforms. That will allow you to define the right concurrency and memory configuration for that queue. Enough concurrency means each Looker query will run. Enough memory means you minimize the volume of disk-based queries.

In your intermix.io dashboard, you can see the high watermark / peak concurrency for your Looker queries. You’ll also see how much memory they consume, telling you what memory percentage you need to assign to each slot.

By using the right setting, you can balance your Redshift settings with your Looker workloads. This step alone will give you much faster dashboards.

Step 2: Optimize Your Looker Workloads

What is a “redundant Looker workload”? It’s a query that’s running but doesn’t need to. For example, if users are refreshing their dashboards more frequently than they need to. By reducing that refresh rate, your Redshift cluster will have to process less queries. That drives down concurrency.

Image 4: Identifying High-Volume Looker Users in intermix.io

With our app tracing feature, you can see which one of your Looker users are driving most of the query volume, down to the single Look. See Image 5 with customer feedback via Slack during our private beta for app tracing.

Image 5: Finding high volume Looker users

The final step then is to see if you need to add more nodes, or if there’s an opportunity to reduce your node count.

Step 3: Optimize Your Amazon Redshift Node Count

Once you’ve squeezed all the juice of your WLM, it’s time to adjust your node count. If you’re still encountering concurrency issues or disk-based queries, it may be time to add more nodes. In most cases though we see how there’s an opportunity to reduce node count and save on your Redshift spend.

Consider the case of our customer Remind, a messaging app for schools. By configuring their WLM they managed to reduce their Amazon Redshift spend by 25%.

That’s it! There are a few more tweaks you can do that will improve performance. Examples are setting your dist / sort keys for your PDTs, or moving some PDTs into your ELT process. But the 3 steps in this post will give you the biggest immediate return on your Looker investment

So if you’re ready to scale, get fast dashboards, and handle more data and users with Looker, sign-up for a free trial. We’ll help you streamline your WLM queues. And as your company grows, you can be confident that your Looker dashboards will always be lightning fast.