This post will work through how to understand the root cause of these errors and how to fix them.

Background

Amazon Redshift implements a scheme called serial isolation. This is necessary to achieve high concurrency for write operations. It allows for many transactions to write to a single table at the same time.

Here’s how it works:

Take two transactions: T1 and T2 which write to TABLE1. Thus when either T1 or T2 run, they will change TABLE1. Let’s run these transactions in two scenarios:

If TABLE1 always looks the same after either scenario 1 and 2 – then Redshift is able to run T1 and T2 in parallel. No problem!

But in rare cases, it is possible for parallel execution of T1 and T2 to yield a different result. This would violate data integrity for the database. Redshift will act to prevent this, by aborting the transaction (and rolling it back).

Here’s an example. Consider an empty table. Transaction T1 SELECTs all the rows in the table and then INSERTs a row. Transaction T2 INSERTs a row into the table and then SELECTs all the rows in the table.

Scenario 1

T1 executes to completion, followed by T2. T1 will SELECT no rows from the empty table and INSERT a row, and then T2 will INSERT a row and SELECT 2 rows. So the SELECTed output would be 0 rows for T1 and 2 rows for T2.

Scenario 2

Reverse the order, so T2 goes first. T2 will INSERT a row and then SELECT all the rows (i.e. 1) in the table, and then T1 will SELECT all rows and then INSERT a row. So the SELECTed output of the transactions will be 1 row for T2 and 1 row for T1.

The Violation

Let’s look at what might happen during execution is parallel:

T2 could INSERT first,

T1 could then SELECT and INSERT, and

then T2 could SELECT.

This would result in T1 SELECTing one row but T2 SELECTing two rows. We cannot get this result through any serial execution of T1 and T2. T1 aborts since T1 did the WRITE that triggered the violation.

Database “Snapshots”

How does Redshift do this? Redshift takes snapshots of the database following any of the following statements:

SELECT

COPY

DELETE

INSERT

UPDATE

TRUNCATE

ALTER TABLE

CREATE TABLE

DROP TABLE

TRUNCATE TABLE

Monitoring for violations begins immediately after the snapshot. You can assume it’s happening all the time.

Debugging a Real-World Scenario

Let’s look at a real-world scenario, and how we could debug it. Here is an example that we recently came across in one of our clusters.

Operations on temp tables are gray. Temp tables cannot trigger a violation error since they are per-session. Still, they can be useful for debugging because they start a serial isolation (due to the snapshot).

Here’s what we learned:

Transaction 40805600 reads from table 13187208

Transaction 40805609 reads and writes from 13187208

Transaction 40805622 reads and joins tables 13187208 to 13187250. (this is only evident in the omitted query text)

We can see how this cycle of transactions triggered the error. 40805600 read at the beginning of the cycle. Then 40805609 read and wrote. Finally, 40805600 read and wrote again (via the delete). And due to its join, 40805622 introduced a dependency between the two table writes.

Fixing Serialization Violation Errors

Up till now, we’ve diagnosed and understood the cause of our serialization violation. How can we avoid this problem? There are three approaches:

Move SELECT statements out of the offending transactions.

When possible, move out all the SELECTs. This will resolve the conflict. But this may not be possible when your WRITE operation depends on the result of SELECT.

Pessimistic locking.

You can prevent the conflict by locking the table using the LOCK statement. In our case, we could lock 13187208 at the start of 40805600 or 40805609. This comes at the cost of concurrency and performance. Locking a table can affect transactions outside of this cycle if they need to read that table.

Reschedule transactions.

Change the query scheduler to run transactions at different times. Transaction 40805609 is a COPY job and 40805600 and 40805622 are transformations. We could avoid the conflict completely by ensuring they don’t run at the same time.

Going beyond troubleshooting

Hope this post helped you find and fix Serializable isolation violation in Redshift. Going further I would suggest looking at our “Top 14 Performance Tuning Techniques for Amazon Redshift” where we talk in-depth on how to improve query performance and all the challenges in scaling your workloads.