With Amazon Redshift, you can implement any type of data model that’s standard throughout the industry. Whether your data model is third normalized form (3NF), star, snowflake, denormalized flat tables, or a combination of these—by using Amazon Redshift’s unique table properties, your complex analytical workloads will operate performantly over multipetabyte data sets.

In practice, I find that the best way to improve query performance by orders of magnitude is by tuning Amazon Redshift tables to better meet your workload requirements. This five-part blog series will guide you through applying distribution styles, sort keys, and compression encodings and configuring tables for data durability and recovery purposes. I’ll offer concrete guidance on how to properly work with each property for your use case.

Prerequisites

If you’re working with an existing Amazon Redshift workload, then the Amazon Redshift system tables can help you determine the most ideal configurations. Querying these tables for the complete dataset requires cluster access as a privileged superuser. You can determine if your user is privileged with the result of the usesuper column from the following query result set:

In Amazon Redshift, a table rebuild is required when changing most table or column properties. To reduce the time spent rebuilding tables, identify all of the necessary changes up front, so that only a single rebuild is necessary. Once you’ve identified changes, you can query one of our amazon-redshift-utils view definitions (v_generate_tbl_ddl) to generate the existing DDL, for further modification to implement your identified changes.

I’ve also improved the system view SVV_TABLE_INFO with a new view, named v_extended_table_info, which offers an extended output that makes schema and workload reviews much more efficient. I’ll refer to the result set returned by querying this view throughout the series, so I’d recommend that you create the view in the Amazon Redshift cluster database you’re optimizing.

For the sake of brevity throughout these topics, I’ll refer to tables by their object ID (OID). You can get this OID in one of several ways:

Prioritization

This series walks you through a number of processes that you can implement on a table-by-table basis. It’s not unusual for clusters that serve multiple disparate workloads to have thousands of tables. Because your time is finite, you’ll want to prioritize optimizations against the tables that are most significant to the workload, to deliver a meaningful improvement to the overall cluster performance.

If you’re a direct end user of the Amazon Redshift cluster, or if you have well-established communication with end users, then it might already be obvious where you should start optimizing. Perhaps end users are reporting cluster slowness for specific reports, which would highlight tables that need optimization.

If you lack intrinsic knowledge of the environment you’re planning to optimize, the scenario might not be as clear. For example, suppose one of the following is true:

You’re an external consultant, engaged to optimize an unknown workload for a new client.

You’re an Amazon Redshift subject matter expert within your circles, and you’re often approached for guidance regarding Amazon Redshift resources that you didn’t design or implement.

You’ve inherited operational ownership of an existing Amazon Redshift cluster and are unfamiliar with the workloads or issues.

Regardless of your particular scenario, it’s always invaluable to approach the optimization by first determining how best to spend your time.

I’ve found that scan frequency and table size are the two metrics most relevant to estimating table significance. The following SQL code helps identify a list of tables relevant to each given optimization scenario, based on characteristics of the recent historical workload. Each of these result sets are ordered by scan frequency, with most scanned tables first.

Scenario: “There are no specific reports of slowness, but I want to ensure I’m getting the most out of my cluster by performing a review on all tables.”

-- Returns table information for all scanned tables
SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
SELECT DISTINCT tbl FROM stl_scan WHERE type=2
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
size DESC;

Scenario: “The queries running in transaction with XID=23200 are slow.”

-- Returns table information for all tables scanned within xid 23200
SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
SELECT DISTINCT tbl FROM stl_scan
WHERE type=2
AND query IN (SELECT query FROM stl_query WHERE xid=23200)
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
size DESC;

-- Returns table information for all tables scanned by “etl_user”
-- during 02:00 and 04:00 on 2016-09-09
SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
SELECT DISTINCT tbl FROM stl_scan
WHERE type=2
AND query IN (
SELECT q.query FROM stl_query q
JOIN pg_user u ON u.usesysid=q.userid
WHERE u.usename='etl_user'
AND starttime BETWEEN '2016-09-09 2:00' AND '2016-09-09 04:00')
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
size DESC;

Scenario: “Our reporting workload on tables in the ‘sales’ schema is slow.”

-- Returns table information for all tables scanned by queries
-- from "reporting_user" which scanned tables in the "sales" schema
SELECT * FROM admin.v_extended_table_info
WHERE table_id IN (
SELECT DISTINCT tbl FROM stl_scan
WHERE type=2 AND query IN (
SELECT DISTINCT s.query FROM stl_scan s
JOIN pg_user u ON u.usesysid = s.userid
WHERE s.type=2 AND u.usename='reporting_user' AND s.tbl IN (
SELECT c.oid FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE nspname='sales'
)
)
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC,
size DESC;

Amazon Redshift Engineering’s Advanced Table Design Playbook

About the author

Zach Christopherson is a Palo Alto based Senior Database Engineer at AWS. He assists Amazon Redshift users from all industries in fine-tuning their workloads for optimal performance. As a member of the Amazon Redshift service team, he also influences and contributes to the development of new and existing service features. In his spare time, he enjoys trying new restaurants with his wife, Mary, and caring for his newborn daughter, Sophia.