Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

I have a table which links several fairly long text fields (~100 chars) to a single text field: { A, B, C, D } => E. The dependent value can change, so I query data daily and record A, B, C, D, EffectiveDate, E. My design right now has a clustered PK on { A, B, C, D, EffectiveDate }, in that order. I had thought that I could then query the most recent value easily:

Since the self-join is on the first fields in the clustered key, I expected good performance, with a merge join. However, it's taking about two minutes with less than a million rows in the table. This is part of a near-real-time update, so I really need it to take 30 seconds, tops.

Is there a better indexing strategy for this scenario? I have a couple ideas which I have not tried, because of some combination of (A) I'm lazy and (B) I'd rather get the right answer from the wise people on Stack Exchange, so I'll know better in the future. I realize that the "best" solution will depend on exactly the data I have, but I suspect that there's a good general solution I'm missing, I just need to adjust my clustering somehow.

I could replace the wide PK with a synthetic integer key but keep the wide key for clustering, but my understanding is that this would only reduce the weight of any additional indices, of which this table has none.

I could should stop recording data every day, and instead only record changes. Values for E change every week or two on average, so I'm definitely bloating the data around tenfold.

I could create a synthetic key to collapse these text values to integers, or maybe use a hash to join first and then resolve collisions with the exact values.

I could shift to a type 4 SCD and do all the heavy lifting during overnight ETL.

Finally, you could use HASHBYTES to collapse the columns into a hashed computed column. You can add this to the index too as the first column so this is highly selective because it's almost unique
A surrogate key may be useful

I understand why selectivity is important for indices in general, but in this case I'm using all four fields; the system can, in principal, scan through the table once and just find the latest EffectiveDate for each combo of {A, B, C, D} - and those are contiguous. Thanks for the link, I will check that out!
–
Jon of All TradesJan 26 '12 at 20:35

Re: Smith's link: very interesting discussion! That could be a contributing problem, as the four VARCHARs have a total size of 600, but the average character count is 48. Unfortunately, I do have a handful of longer values, and the data could be wider still in the future, so I hesitate to cut it down.
–
Jon of All TradesJan 26 '12 at 20:41