Clustered Index Key – Natural or Surrogate?

Background

This article will follow a practical approach to choose the right candidate for clustered index. It’ll investigate through the best practices recommended for clustered index, with the help of an experiment.

Primary Key and Clustered Index

Primary Key and Clustered Index are different from each other. Clustered Index should not be confused with Primary Key. Primary Key can be created with / without Clustered / Nonclustered Index.

Description of terminologies used

Natural Clustered Key: Column or set of columns forming the clustered index, naturally part of table having some business meaning associated with it.

Clustered Index best practices

It should be as narrow as possible in terms of the number of bytes it stores.

It should be unique to avoid SQL Server to add Uniquefier to duplicate key values.

It should be static, ideally, never updated like Identity column.

It should be ever-increasing, like Identity column to avoid fragmentation and improve the write performance.

Why should we always have a Clustered Index?

Clustered Index out-performs NonClustered Index, since it stores the data for every row. Data page is stored in the leaf node of the Clustered Index.

Data is sorted based on clustered index key values and then stored.

The only time the data is stored in sorted order when the table contains a clustered index. Sort operation is most costly operator in SQL Server.

Since data is stored in sorted order, hence sorting on the clustered key column avoids the sort operator, and makes it best choice for sorting.

It helps the Database engine to lock a particular row, instead of a Page or a Table. Row level locking can be only achieved if table has Clustered Index. SQL Server applies lock on Row, Page or Table to ensure consistency.

What should be our Clustered Index?

By looking at the best practices, it’s very hard to find a Natural Key that can satisfy all the four recommendations.

Prima facie, it seems Surrogate Key, such as an Identity column seems to be the only best candidate for Primary Key.

Wait, if Surrogate Key is the best candidate then how to deal with following overheads?

How to minimize Sorting overhead?

Clustered Index column should be referred in Sort operation in the same sort order as defined in the index. Surrogate Key for e.g. Identity column is hardly the sorting requirement.

How to minimize Lock escalation – Page or Table lock?

All the DELETE and UPDATE should happen on Clustered Index Key column. There is hardly a requirement to delete / update based on Surrogate Key.

Alternate approach could be – take all the clustered key column values in a temp table and then using join or sub-query use it in delete / update statement. This is an additional overhead.

What we’ll achieve through this experiment?

We’ll compare the following parameters. It’ll be then left to us to decide what best suits to us, depending upon the requirement and workload.

Execution Time

Avg. Read Execution Time

Avg. Write Execution Time

Index size and read/write count

Index size (KB)

num_of_writes

num_of_reads

Lock/Latch count/wait count/wait time

row_lock_count

page_lock_count

page_lock_wait_count

page_lock_wait_in_ms

index_lock_promotion_attempt_count

page_latch_wait_count

page_latch_wait_in_ms

tree_page_latch_wait_count

tree_page_latch_wait_in_ms

Index usage stats

user_seeks

user_scans

user_updates

Workload considered for this test

IoT (Internet of Things) and IIot (Industrial Internet of Things) domains are leading domains. These domains deals with huge volume of data. Data volume is huge in both read as well write, since it deals with machines.

A machine has multiple components and every components has multiple sensors and every sensor sends multiple messages per seconds/milliseconds.

Timestamp is the key attribute for the calculation of the KPI’s to show the real-time analytics.

We’ll have a very simple example with one table, that holds the Coffee Vending Machine’s Sensor Data. Every second, 1000’s of records of multiple machines gets written into this table. The scenario here pertains to an application which gets refreshed every 10 seconds and reads past few minutes of data to render analytics over the UI.

Enable all the jobs and disable it after 2 minutes. Once all the jobs are disabled run the stats select queries.

Findings

Execution#4 with Natural Clustered Key won with GOLD. It stood 1st in the Ranking in our experiment.

Execution#3 with Surrogate Clustered Key (with nonclustered covering index) was runner up with SILVER. It stood 2nd in the Ranking in our experiment. It took the additional storage space which in-turn is an overhead to Disk IO.

Execution#5 without any Index stood BRONZE. It stood 3rd in the Ranking in our experiment.

Execution#1 and Execution#2 both lost the game in our experiment. Both of them stood 4th in the Ranking in our experiment.

Conclusion

There is no panacea solution. It’s always advisable to evaluate every recommendation / best practice and its trade-offs holistically, before actually applying it.

Sometimes it’s wise decision not to have any index (including clustered index), than having additional overhead of unused / poorly performing index.