Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

I have fairly large fact table (2 billion records, approx 120 GB). This table is not partitioned and the queries are very slow to respond. I am planning to partition the table and indexes.
The table has an identity column which is the primary key and has a clustered index on it. There are other non-clustered indexes on it but I won't go in the details much here. The column, I am trying to partition is not part of primary key but is not null and this is creating a slight dilemma for me. I have two options.

I add this column as part of primary key i.e. composite primary key. Since the first column is identity, the combination would always be unique which means I don't have to worry about the applications accessing the table. The clustered index will automatically be partition aligned and other indexes can also be partition aligned.

The seconds option is to remove the clustered index on the identity column and make it unique non clustered. This index cannot be partitioned aligned since partition key is not part of it and hence would have to sit on one drive. Then create a clustered index on the partition key column which can be partition aligned and so all the other non clustered indexes.

Our DBA is in favour of second option since he doesn't want to change primary key. I am worried about the performance hit in option 2 since the index is not partition aligned.

I would appreciate any feedback plus any other method you would have used in such situation.

2 Answers
2

I hate to state the obvious but I'd say test both scenarios and run your production queries against all 3 scenarios (scenario 1 being the current non partitioned one). The reason I state this is because I don't know what your code is querying. Do they actually have a benefit of being sorted in the base table by the identity column as opposed to the other column? For example, are your queries actually looking for the row ID often? If you're not sure, you might be surprised by getting some performance benefits. The general idea of using ID as the clustered key was for range scans, but in our case, we scan by customerID and Date, so it worked out perfectly for us, and perhaps you. Check out this article by Kim Tripp:
http://www.sqlskills.com/blogs/kimberly/post/The-Clustered-Index-Debate-Continues.aspx

"What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits. However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list. Page Splits are managed by proper FILLFACTOR not increasing INSERTS. Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting.
In short, the argument runs shallow
"

Typically you want your clustered index to be as narrow as possible, unique, and non nullable as it get's carried into all other indexes. I have a table with roughly 10 billion rows and we partition off the datetime column, which has worked out great.

hey thanks for that. I will go through the link. There is a cube on top of the fact table. This cube will be also partitioned by same partition key I use in the table. No other application accesses the table directly AFA I know. So I am hoping that SELECT * FROM ..WHERE PartionKey BETWEEN X AND Y type of queries would be faster. However, if we go by option 2, the SQL Server will enforce the primary key using non clustered index which is on a single drive. This might slow down the INSERT operation. Any thought?
–
SeanNov 9 '12 at 16:39

1

Hi Sean, My thought was why even have the identity column as part of your clustered index to begin with? Do you know if your app really needs that? From what you described it seems like you don't need it but I could be wrong. How about removing IDENTITY from the clustered key?
–
Ali RazeghiNov 9 '12 at 16:56

The date or date key can be an EXCELLENT choice in clustering and partitioning for a fact table in a lot of cases.

If you are often scanning by a date range, including date in your criteria this can work well. By partitioning on the date and aligning your clustered index by changing it (yeah that is a bit of work for the DBA but not much and worth it.. I actually like Identity Clustered Indexes and surrogate keys in a lot of tables, but for fact tables I tend to go with a date value.. especially when date is included in most of your queries to the fact table, and in a lot of cases it is... Lot of range scans by date, for instance) you can really take advantage of partition elimination during work. Say you normally query by month, partitioning by month and aligning can potentially eliminate all of the partitions except for the month you are querying. Now you are scanning along that one partition instead of the big table as a whole.

You really need to test as SQL-Learner says.. But date is monotonically increasing in inserts typically.. Would be aligned if you changed the clustering key (really no reason to not change clustering key.. Do you ever even use that identity for ANYTHING on your fact table?

If this were a dimension, then I would be having a different discussion. In that case, I would ask which key you relate your dimension to your fact table by. It would either be a natural key or a surrogate key and in that case, the identity column would be a fine clustered key for the dimensions. The fact table is a little different. In most warehouse applications, the fact table isn't usually being joined to other tables by the fact table's key, but it is being joined to dimensions by their keys. The date clustered index can be a good approach to at least try in warehouse fact tables.