Optimizing transactions for SQL Data Warehouse

In this article

This article explains how to optimize the performance of your transactional code while minimizing risk for long rollbacks.

Transactions and logging

Transactions are an important component of a relational database engine. SQL Data Warehouse uses transactions during data modification. These transactions can be explicit or implicit. Single INSERT, UPDATE, and DELETE statements are all examples of implicit transactions. Explicit transactions use BEGIN TRAN, COMMIT TRAN, or ROLLBACK TRAN. Explicit transactions are typically used when multiple modification statements need to be tied together in a single atomic unit.

Azure SQL Data Warehouse commits changes to the database using transaction logs. Each distribution has its own transaction log. Transaction log writes are automatic. There is no configuration required. However, whilst this process guarantees the write it does introduce an overhead in the system. You can minimize this impact by writing transactionally efficient code. Transactionally efficient code broadly falls into two categories.

Adopt a partition switching pattern for large modifications to a given partition

Minimal vs. full logging

Unlike fully logged operations, which use the transaction log to keep track of every row change, minimally logged operations keep track of extent allocations and meta-data changes only. Therefore, minimal logging involves logging only the information that is required to roll back the transaction after a failure, or for an explicit request (ROLLBACK TRAN). As much less information is tracked in the transaction log, a minimally logged operation performs better than a similarly sized fully logged operation. Furthermore, because fewer writes go the transaction log, a much smaller amount of log data is generated and so is more I/O efficient.

The transaction safety limits only apply to fully logged operations.

Note

Minimally logged operations can participate in explicit transactions. As all changes in allocation structures are tracked, it is possible to roll back minimally logged operations.

Internal data movement operations (such as BROADCAST and SHUFFLE) are not affected by the transaction safety limit.

Minimal logging with bulk load

CTAS and INSERT...SELECT are both bulk load operations. However, both are influenced by the target table definition and depend on the load scenario. The following table explains when bulk operations are fully or minimally logged:

Primary Index

Load Scenario

Logging Mode

Heap

Any

Minimal

Clustered Index

Empty target table

Minimal

Clustered Index

Loaded rows do not overlap with existing pages in target

Minimal

Clustered Index

Loaded rows overlap with existing pages in target

Full

Clustered Columnstore Index

Batch size >= 102,400 per partition aligned distribution

Minimal

Clustered Columnstore Index

Batch size < 102,400 per partition aligned distribution

Full

It is worth noting that any writes to update secondary or non-clustered indexes will always be fully logged operations.

Important

SQL Data Warehouse has 60 distributions. Therefore, assuming all rows are evenly distributed and landing in a single partition, your batch will need to contain 6,144,000 rows or larger to be minimally logged when writing to a Clustered Columnstore Index. If the table is partitioned and the rows being inserted span partition boundaries, then you will need 6,144,000 rows per partition boundary assuming even data distribution. Each partition in each distribution must independently exceed the 102,400 row threshold for the insert to be minimally logged into the distribution.

Loading data into a non-empty table with a clustered index can often contain a mixture of fully logged and minimally logged rows. A clustered index is a balanced tree (b-tree) of pages. If the page being written to already contains rows from another transaction, then these writes will be fully logged. However, if the page is empty then the write to that page will be minimally logged.

Optimizing deletes

DELETE is a fully logged operation. If you need to delete a large amount of data in a table or a partition, it often makes more sense to SELECT the data you wish to keep, which can be run as a minimally logged operation. To select the data, create a new table with CTAS. Once created, use RENAME to swap out your old table with the newly created table.

Optimizing updates

UPDATE is a fully logged operation. If you need to update a large number of rows in a table or a partition, it can often be far more efficient to use a minimally logged operation such as CTAS to do so.

In the example below a full table update has been converted to a CTAS so that minimal logging is possible.

In this case, we are retrospectively adding a discount amount to the sales in the table:

Optimizing with partition switching

If faced with large-scale modifications inside a table partition, then a partition switching pattern makes sense. If the data modification is significant and spans multiple partitions, then iterating over the partitions achieves the same result.

The steps to perform a partition switch are as follows:

Create an empty out partition

Perform the 'update' as a CTAS

Switch out the existing data to the out table

Switch in the new data

Clean up the data

However, to help identify the partitions to switch, create the following helper procedure.

Pause and scaling guidance

Azure SQL Data Warehouse lets you pause, resume, and scale your data warehouse on demand. When you pause or scale your SQL Data Warehouse, it is important to understand that any in-flight transactions are terminated immediately; causing any open transactions to be rolled back. If your workload had issued a long running and incomplete data modification prior to the pause or scale operation, then this work will need to be undone. This undoing might impact the time it takes to pause or scale your Azure SQL Data Warehouse database.

Important

Both UPDATE and DELETE are fully logged operations and so these undo/redo operations can take significantly longer than equivalent minimally logged operations.

The best scenario is to let in flight data modification transactions complete prior to pausing or scaling SQL Data Warehouse. However, this scenario might not always be practical. To mitigate the risk of a long rollback, consider one of the following options: