Why use a GUID for a clustering key?

Conventional wisdom in the database world is that nobody should ever use a GUID for a clustering key. And the people who do are clearly awful people who hate kittens and everything good in the world. I’ve probably even gotten up in somebody’s business about this before. The problem with common knowledge is that it may not always be right. After all, common knowledge used to be that the moon made you crazy.

Think like an architect

Before discounting GUIDs as a clustered index, think about what your application requires.

Software and database architects take into account an application’s requirements as they go through the design process. It’s necessary to think about how an application needs to identify data as it’s generated – you might have to generate values for the clustering key in the application tier long before an object graph is persisted to the database. Or you might want to make sure that you have more than 2,147,483,647 possible keys (you could double that if you start at -2,147,483,648).

I’ve got your BIGINT right here!

“But what about BIGINT“, you shout. With modern use cases (like the Internet of Things) it’s possible to generate tremendous volumes of data. Long ago, I worked at a utility company. Every meter could generate readings at whatever frequency the business wanted. More frequent readings allow better long term trending, forecasting, and pricing but at a cost. With storage now being stupendously cheap (I have around 6 TB of raw SSD capacity in my desktop), it’s possible to store huge amounts of data.

Let’s think through the requirements:

A single meter could generate 1 reading per minute which is 1,440 readings per day.

Over 1 year that’s 525,600 readings per power meter.

For a 32-bit integer, that’s 8,171 meter years of readings. We could exhaust that in a few years with any reasonable number of homes using our utility.

For a 64-bit integer, that’s 35,096,545,041,304 meter years of readings.

There are 3 million households in New York City.

Let’s say we get half of them on our service.

That’s still 23 million meter years.

So, what would we need GUIDs for?

We might need GUIDs if pre-processing occurs in the application to associate multiple data points with one another (foreign keys). Or if we anticipate having even more data coming into the system. With increases in metering and monitoring technology, it’s possible that we could be collect additional sensor data to go along with all of those meter readings (think HVAC and lighting information combined with weather station data). GUIDs can also help avoid intermediate page contention in write heavy workloads.

But the randoms!

GUIDs are random, at least as generated by SQL Server’s NEWID function. And, as we all know, randomness is bad because it causes fragmentation in our data structures. Here’s the deal – unless you are only appending data to the end of a table with no indexes, all of your inserts will cause fragmentation.

Sure, GUIDs may cause fragmentation in the clustered index on your base table. Can you conclusively prove, outside of a contrived example, that the primary performance problem in your application is fragmentation caused by GUIDs? And that this isn’t just free space on disk? Remember – if you have indexes, you’re going to have out of order inserts into those indexes. Are you positive the poor performance isn’t just your awful code or bad indexing or maybe just out of date stats?

Before blaming the randomness of GUIDs, do your homework and prove where the bottlenecks exist in your application. Fix those. Find the next bottleneck. Rinse. Repeat.

What’s the takeaway?

It’s important to understand the reasons that you’re using one datatype over another. Understand the application’s patterns before you make a decision – database decisions last forever. OK, maybe they don’t last forever, but bad database decisions become difficult to change once you have any significant data volume present. Do your homework, understand how your choices affect you, and don’t be afraid to go against conventional wisdom when you know what’s right for your application.

Normally I love everything you write, but this is way off the mark. I’ve proven this a number of times. Just a couple examples. 1. I did a benchmark yrs ago where I compared the perf of int and guid clustered indexes. The guid fully fragmented after 15mins of heavy OLTP load and ground to a halt. The int never showed noticeable perf degradation. As well, we did a contract with a company that was redoing the app the dallas police use to query from their cars, and for the jails. They didn’t take our advice and they used clustered guids. The entire police network app crashed completely in less than half a day. They had to go back and change them all to int and things were much much better. So yeah, while some people can have issues with their apps, the clustered guids are going to make things far worse.

2. I was importing perfmon data into a table. The table used guids. Importing the data has to be done into that original table, but I in turn move it to another table with int. Building the index on the guid table took about 45mins and on the int table it took about 7mins. And reindexes went much faster. And the exact same joins went so much faster with the int tables than with the guids too.

So no, using guids is ok if you have to, and sometimes you do, but they shouldn’t be clustered.

Also “Hey, this thing happened once and it was bad” isn’t really a valid argument. Once, someone went to South Africa and they got eaten by a lion. Truthfact. Should we all avoid South Africa? Probably not.

First of all… I’m not disagreeing with you. But… I’m just not able to wrap my head around what you’re saying. What I know …if a clustering key is random it would be slower to write to the disk right? Not to mention a higher likelihood of page spilts. So how is this okay in a highly transactional database? I really really want to understand. (Asking for a friend 😉 )

Writes are writes – writing an 8KB block doesn’t get slower when it’s random. The time to find the right location will change, though. But SQL Server only writes data pages during a checkpoint. During that checkpoint, SQL Server will gather up pages that will be located next to each other on disk and then write as many pages sequentially as possible. Of course, this also assumes that the storage subsystem is going to behave the way it should, which is a topic for a different blog post.

Page splits… My least favorite boogeyman. Having indexes increases your likelihood of page splits. It’s true – indexes are random compared to the insert order in the table. Data is usually being inserted into the middle of the index. You can’t avoid page splits, they always happen. Instead we should design systems that meet the needs of our applications and the load being placed on the system.