redshift

Although Amazon Redshift is a fantastic data warehousing product, it lacks several of the features available in other database systems. One such feature is materialized views, which are just what they sound like... views that have been pre-calculated and physically stored on disk for faster querying, and which are updated on request.

No, not that kind of View.

Materialized views are very useful any time you want to have the results of complex queries readily and quickly available, and you don't mind if the data is a few hours stale. For instance, suppose you have an executive dashboard that displays your monthly gross margin for your highly successful app, which has millions of users. The query that does those calculations runs for 20 minutes (way longer than an exec is going to want to wait), but it only needs to be updated once a month. Materialized views are the perfect solution. Calculate once, cache the data, and reference the cache on-demand. Refresh when needed.

The "Redshift View Materializer", now available on GitHub, is a simple Python script that creates tables containing the results of arbitrary SQL queries on-demand. Simply set the script to run as a cron-job whenever you want your tables re-created, and you'll end up with a reasonably close approximation of materialized views. The script is in the public domain, but I'd love if you'd contribute if you make any improvements, bug fixes, or additions!

Last week, I looked into options for increasing the performance of an Amazon Redshift cluster that was currently using 10 large dense compute nodes (dc1.large). While investigating, I noticed that a cluster of 32 dc1.large nodes (the maximum for that node type) had the same number of CPUs, the same amount of storage, and a comparable amount of RAM to a cluster of 2 dc1.8xlarge nodes (the minimum for that node type), while offering significant cost savings for anything less than a 3-year term. This got me wondering... if you want a cluster that's around that size, which is a better bet for performance?

We'll get the answer out of the way early for the tl;dr crowd.

The results of testing out three diverse queries on various-sized Redshift clusters.

The cluster of 32 dc1.large nodes provided substantially (13%, on average) faster performance for all but the shortest query in my test batch. Given the superior performance and lower cost, it was the obvious choice for a cluster of this size. Whenever more performance is called for, it will make sense to switch to a cluster of 3 dc1.8xlarge nodes...

My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that's pretty complicated. So we went looking for a better solution.

Based on all available evidence, the survival curve for aliens singing disco music hits 0 at about 43 seconds (N = 1).

As you likely guessed from the title of this post, that better solution involved writing our own code for calculating a Kaplan-Meier estimator in SQL. In this post, I'll be walking through our strategy step-by-step, including the SQL code for calculating the estimators and making that code reusable in Periscope. Let's do this!