How we handle data

The Data Science & Infrastructure team at Plaid has grown significantly over the past few months into a team whose mission is to empower Plaid with a data-first culture. This post is a look at how we rebuilt internal analytics around rollup tables and materialized views at Plaid.

DSI: It’s a lifestyle

After building a scalable monitoring pipeline with Kinesis, Prometheus, & Grafana and beefing up our ETL efforts with Airflow, Plaid was in the midst of a transition to Periscope Data as a business intelligence tool. After testing and analyzing, we had decided to use Periscope for tracking metrics around our core product usage, go-to-market strategy, and internal operations for customer support and project management.

Snapshot of a Periscope Dashboard

At Plaid, we take a lot of pride in being a data driven company, and as such, the DSI team took on the responsibility for getting the data into our AWS Redshift data warehouse which powers the charts connected to Periscope. We decided to also own the query runtime performance of the SQL statements being written in Periscope to ensure our data collection efforts were maximally useful to the company. The tool was being rapidly adopted internally by both a collection of power users who were proficient in SQL, and by less experienced folks who had just started to get their feet wet.

The original use-case for our Redshift cluster wasn’t centered around an organization-wide analytics deployment, so initial query performance was fairly volatile: the tables hadn’t been setup with sort and distribution keys matching query patterns in Periscope, which are important table configuration settings for controlling data organization on-disk, and have a huge impact on performance. We also hadn’t broadly invested into Redshift settings like workload management (WLM) queues and the data stored in the cluster was a comprehensive dumping ground of data, not a schema tailored for analytics.

With “data” in our team name, we couldn’t go further in this post without establishing some key metrics. Fortunately, Periscope offers a great set of meta-tables related to the usage of the tool, for example tables showing which charts were being run by different users and how long the query ran from Periscope’s point of view.

Initial discovery

Looking into the data we saw that the p90 runtime (the sparkline in the top right corner in the image above) was fairly volatile, ranging from high single digits to tens of seconds week to week. More so, this first view looked at all the queries being run, but we wanted to value queries that were important to the success of the business. We set up some additional constraints:

We only considered weekday data points, and only after the entire day’s worth of data was available. We didn’t count weekends because the cluster usage pattern differed too much versus weekdays: since we were tracking our statistics daily, when query volume dropped on the weekends it created visual noise that detracted from analyzing normal weekday patterns.
We only counted queries being run by a user — not queries that Periscope was running in the background to keep charts up-to-date.
We excluded queries being run by the DSI team members themselves, since we were a large, noisy set of data points and didn’t have the same distribution of runtimes as many of the other users.

User-run queries turned out to be quite contentious! The runtimes look much slower than the general population of queries. For our first foray into critically analyzing our cluster’s performance, we targeted two main categories of potential problems:

Redshift cluster settings and data layout: Only some tables had their on-disk data distribution defined through these two keys, but others had no sort key and no purposeful distribution style (like sort and distribution key settings, for example.)
Outlier user behavior: we expected some users to be running larger queries than others. We ran segmentation analysis by users, dashboards, and tables, to look for readily available patterns where performance was drifting upwards.

Solving for X

Off the bat we knew the first line-item was an issue in several circumstances; some tables had infinite rows and did not have a sort key set and as such, were highly likely to be less efficient than they should be. After manually reading through the SQL in several charts and interviewing the more prolific query authors, we settled on deploying timestamps as the sort key everywhere, since often the initial query step was to hone in on a specific time-slice of data.

Similarly, we modified distribution keys to work with our join and aggregation patterns. There are many great blog posts on this kind of work, and the AWS Redshift Documentation has lots of great pointers as well. While this was a useful endeavor and had some impact on query speed, it was relatively small and not going to be the overarching solution to our run-time problems.

Output from stl_alert_event_log highlights some table-specific issues

We conducted additional experiments using recommendations from the AWS support team, using both the Redshift built-in performance alert infrastructure like stl_alert_event_log and other experiments exploring schema changes like splitting tables into different monthly tables like logs_2018_01 , logs_2018_02, logs_2018_03, and union all-ing them back together. These solutions weren’t the ideal fit as they lacked context around our use-cases. Adding more cluster discipline by running vacuum and analyze on each table on a regular basis, and setting up better work-load management queues were also small wins, but we found big wins in the form of optimizing the company’s holistic approach to running queries.

In order to get a better understanding of the patterns specific to our company, we would first cross join all the queries against all the tables to inspect the underlying sql for matching names. This was made easy in part because we forbade usage of the public schema and table names tended to be lengthy enough to avoid false-positives.

-- Match each SQL query with any table used in the query
-- by searching the text for the schema_name.table_name of the table
select
id
, table_name
from
charts
cross join [tracked_tables_selection]
where
charts.sql like '%' || table_name || '%'

Discoveries

This led us to our first critical discovery: 95% of the slow queries came from 5% of the tables. Our Pareto distribution was quite steep — we had compounding factors as most interesting tables were the ones that also had the most data points. Logs from the core application behind Plaid were all being added to a single, large table, and users were writing many similar filters and aggregation over the dataset to analyze different time-periods, customer behavior, and error types.

Once we understood the problem, the technical solution was fairly straight-forward: pre-compute the common elements of the queries by creating rollup tables, materialized views, and pre-filtered & pre-segmented tables, and change user queries to run against these derivative tables. We already used Airflow for a good amount of data ETL management, so it was an easy choice to start materializing views and rollups. The associated DAGs are simple to implement; just have Airflow run the SQL query every 15 minutes or so, and then swap out the old table for the new one.

Creating infrastructure for derivative tables let us take granular data and produce something more meaningful to analyze. For example, we can take a raw logging table that has a message per HTTP request, group the data by the type of the request, the client, and the resulting status information, and bucket the data into hourly counts.

The harder part was getting folks internally to migrate onto new structures once we identified the translation pattern. We needed to re-write queries that looked like this:

select
[current_time_ms:hour]
, count(1)
from
logs_raw.apiv1_production
where
[current_time_ms=14days]
group by
1
order by
1

to a similar query that looks like this:

select
_timestamp
, sum(count)
from
logs_hourly.apiv1_production
where
[_timestamp=14days]
group by
1
order by
1

They’re quite similar, and with a deterministic pattern: change the schema and table name, migrate to the _timestamp moniker we had centralized around, and change count(1) to sum(count) to take the aggregate of our hourly rollup aggregation. While we had many different kinds of derivative tables, the technical translation was usually an easy task.

The harder part was the operational deployment of these changes — we had grown to hundreds of dashboards with thousands of charts being used regularly for internal analytics, and it was challenging to both roll-out a change that was far-reaching enough that we could see movement in our metrics, but non-disruptive to the point where it would upend an individual’s workflow during migration.

To combat this, we setup a dashboard that would allow us to track the individual dashboards and charts contributing the slowest runtimes, and then cross-referenced against the distribution of run-times by users to allow us to more narrowly align with individual users internally — making changes on a user-by-user basis instead of a table-by-table basis facilitated greater transparency and communication of the procedural deployment.

Diving into where slow queries were hiding

This worked better not just through cross-team collaboration, it also allowed us to watch the system grow and evolve around our changes. As the tool became more accessible with not only faster query speeds but additional data sources, more teams started using it to track their KPIs and sometimes new authors would generate additional slow-down that we could then work through together.

Eventually, we got to a place where we were comfortable with respect to power-user query runtimes, and began shifting our focus towards involving folks outside of DSI in making and managing these derivative structures, to see if the system could become self-sustaining from an infrastructure perspective. The system had largely stabilized and we had removed nearly all of the “unbearably long” queries that were prevalent before this effort started:

The total migration: 10x the number of queries, 1/10th the query runtime

To anyone else faced with the question of deploying an analytics warehouse at a small company, WLM queues, vacuum analyze, on-disk data layout are all things you’ll need to look into. However, rather than trying to meet a performance bar with behind-the-scenes optimizations, we recommend first understanding your users’ query patterns and use cases, making sure the data is available to them in the most tightly-packed format possible through rollup tables, and reducing query duplication with materialized views and pre-segmented tables.

While we’re always looking for ways to continue improving our processes, we feel good about the initial progress of our query performance and are constantly embarking on new ways to make Plaid a data-first company. If Redshift performance, Spark cluster management, or real-time analytics through the ELK stack is right up your alley, come join us!

Special thanks to Lars Kamp from intermix.io, a great tool for Redshift optimization, for brainstorming Redshift performance with me back in March. And thanks to Angela Zhang for all the great feedback on this post.