“You should use integers for the key rather than strings. The joins will be faster.”

It sounds logical. Surely integers are easier to compare than large complex strings. Question is, is it true?

This is going to take lots and lots of rows to get any significant results. I’m going to do two sets of tests. The first comparing query execution speed between string and integer data types on the join column while keeping the size of the join column the same between the two. This is to tell if there’s a difference just because of the data type.

The second test will have both the data type and the size of the join columns differing, while the total size of the table row will be kept the same. This is to answer the ‘string joins are slower because they are larger’ argument.

Test 1: Same key size, no indexes

The two tables have the same size join column – a bigint in one and a char(8) in the other.

First up, a check of what Statistics IO and Statistics Time show for the two and whether there’s any difference in execution plan. (Full repo code is available for download, link at the end of the post)

Execution plan’s the same, but that shouldn’t really be a surprise. With no nonclustered indexes there have to be table scans (or clustered index scan) and, with the resultsets not ordered by the join key a hash join is about the only join that could be used efficiently here.

The CPU time is the interesting thing. 35% more CPU time from the string join. To check that the difference is consistent and not a once off, I’m going to run the same test 10 times each and use Profiler to catch the durations and CPU times and aggregate.

That’s a notable difference in the average CPU usage. Average of 31% greater CPU usage from the string join over the integer join.

Maybe an index will fix things…

Test 2: Same key size, indexes on join column

Same tables, just with a nonclustered index added on the foreign key column.

The one scan has changed to an index scan and the join type is now merge join (as the indexes provide the join order), but the plan still has the same form (as would be expected) and there’s still a fairly substantial difference in the CPU times.

On average a 50% increase in CPU time. That’s pretty extreme.

Test 3: Same row size, no indexes

For this test, the join columns are now different sizes. (This is the second script in the attached repo code for anyone using that) I’m using an int in the one table and a char(24) in the other. This is probably a little more realistic, if strings are being used as keys and join columns, there’s a very good chance that it will be longer than if an int was used.

Straight into the testing, the query’s the same form, the names of the tables are the only things that changed.

And we’re back to the same plan as in the first test – clustered index scans and hash join, for the same reasons. I seem to have messed up somewhere in trying to keep the tables the same size. Still, 3 reads difference on the lookup table is not really a large difference.

Just as in all the previous tests, the average CPU usage on the string join is markedly higher. This time it’s nearly 100% greater than for the int joins.

Test 4: Same row size, index on join column

In this test I’m breaking my own test rules a bit. While the table’s row size is the same between the two tables, the index row size is not. Still, I feel it’s fair enough as it reflects what would be done on a real system (no one pads out indexes for no reason)

And now we’re over a 100% increase in average CPU times for these two. The differing row sizes (with corresponding differing page counts) will be contributing to that, but just contributing, not causing, since we were seeing similar increases in earlier cases.

Conclusion

Is the use of integer data types better for join columns than strings? It certainly does appear so, and not insignificantly either. Bear in mind though that what I was doing here was a bit extreme. 2.5 million rows in a a query with no filters applied. This shouldn’t be something that ever gets done in a real system. So it’s not a case that YMMV1, it’s a case that it almost certainly will. You probably will see different results, but it is definitely something worth testing when planning data types for a DB

While this may not the final nail in the coffin for natural keys, it is worth keeping in mind when choosing between natural and artificial keys for a system, especially one likely to process large numbers of rows, such as a decision support/datawarehouse system. Test carefully with expected data volumes and expected loads if you’re considering natural keys and decide based on the result of those tests.

20 Comments

Very good and useful analysis, I enjoyed reading it. It would be interesting to see the effect of int vs string joins on queries that return a smaller set of rows but that are executed much more often. Thanks.

One thing to consider is that regardless of the comparative CPU cost of joining 4-byte-ints vs. 20-plus-byte strings, a global truism is to keep columns as short as possible, *especially* those involved in the clustered index or other indexed columns. Even if the joins themselves were no more expensive in terms of CPU cycles, you’re just going to get fewer rows per page with wider columns (which = more reads to satisfy the same request). Never forget that the clustered index columns are repeated in every row of the nonclustered indexes…

Very interesting article. I’ve always ‘known’ ints to be faster than strings for joins, but never cared enough to really test it. Nice to have some supporting evidence. I’ve always subscribed to using IDENTITY surrogate keys and unique constraints to enforce any natural keys. I think the much more interesting metric would be the difference between using ints and guids (uniqueidentifier) as keys.

I’ve had several requests for additional tests (strings where the differences are at the beginning vs at the end, effects of collation), so I will revisit this in the future and do some additional experiments.

Great post, and good food for thought. I do occasionally use char(4) natural keys instead of ints as FKs, and have never wondered about the join issue. May want to think about collation next time I do that! Having said that, I’m actually impressed with how low the string manipulation overhead is, assuming it’s due to the case-insensitive compare.

With the size of data I was playing with, the majority of the elapsed time is going to be in displaying the data, and since the data volumes are the same the display times are going to be close.

Yes, the server has CPU to use, but not to waste. I want the CPU running as efficiently as possible so that I can put more useful work onto one server. Sure, 800ms of CPU is nothing, unless it’s 800ms extra on a query that runs 30 times a minute.

I specially crafted these tables so that the IO would be the same, so that I could focus on the CPU differences. In reality, that’s not going to be the case, in reality there will be a difference in IO between a large table with a 4-byte int and a large table with a 50-byte varchar, especially if the PK is also the cluster.

I achieved different results when comparing in this situation, where a cmopany uses IDENTITY INT PKs, but also has a Code varchar(10) unique column and a Name varchar(50) column. I used TABLESAMPLE (5 PERCENT) to grab a sizable junk of data from one of our regularly used tables. Then I did two test scenarios, one where I converted the Code varchar(10) to Code char(10), and the other leaving it as varchar. Then in each of the lookup tables (128 rows), I added the primary key clustered. Then in the data tables, I added an index for the Code with included columns for the PK, and a couple of dates. Then I did REBUILD on all the indexes I had just made, for safety. Then I executed each query ten times.

But in both examples the char/varchar query ran >= 20 ms faster. I noticed in the execution plans that the string joins used a merge join and the int join used a nested loop.

I have two theories for the CPU difference. One that the index map for the string result set is constructed more efficiently than for interegers- especially if the int result set varies in range like (1,2,3,4,5,10999,12093) from unit tests.
My other theory is that perhaps the string comparisons make better use of multi-core machines?

But in either case I always favor using real primary keys, instead of INT IDENTITY(1,1). If your data becomes corrupted or you want to setup for manual replication in the future, using a real primary key is much more helpful from my perspective.

Still, it was good to see you putting this out there. I wonder if I could reproduce the same results by expanding to varchar(24).