First a bit of background. I recently started a job with a somewhat complicated setup for a 50 TB database. There is a 3 node failover cluster with two named SQL Server 2012 Enterprise instances with a linked server established from the first to the second instance. There are three very large tables in this database, and one is super-extra-large, and accounts for 40 of those TB. Of course I'm having issues with that very same large table, for "Probe" data.

The data has been sharded by year and month into separate databases, i.e. 2013JUL, 2013JUN, 2013MAY... 2012JAN and there is a main database MAINDB where the large tables have been removed, and instead a distributed view created which does a UNION ALL across the individual database Probe tables, including those on the linked server (data older than 2012 is on this second instance). This approach was taken in order to be as transparent as possible to the leading application. There is a check constraint on each individual probe table to ensure the Probe view in MAINDB allows inserts.

The Probe table is also partitioned in weeks on the clustering key. The clustering key is a datetime2 column named "reportStamp" in order to support date range based queries, and the primary key is a composite with reportStamp and a bigint Id column.

There is a single non-clustered index on ProbeData table with the following definition:

We are having performance issues on a query that is returning the most recent probe parameter value (column paramValue) within the last 7 days. To date the query has been invoked ad hoc from an SSIS package, and my goal is to wrap this in a stored proc.

We should know vehicleId, parameterId, and a reportStamp range, and should be able to fully leverage this index as we are attempting to return paramValue.

What we are seeing is sometimes very good performance and takes less than a minute, and other times it takes many minutes, and I have been told as much as a day. The few times I have "caught it in the act" I see the logical and physical read counts in the several millions, so it sorta feels like this query is tipping. I believe the above index should be covering, and my understanding is a covering non-clustered index won't tip. Maybe there is a nuance where paramValue needs to be an included column and is being ignored even though it's a key column? Maybe a parameter sniffing issue?

There is a great deal of contention from the leading application always doing inserts, thus the NOLOCK hint. We don't have the disk iron for me to be comfortable enabling snapshot isolation. In my limited testing I have observed too much TempDB contention.

I have tried adding OPTION (RECOMPILE) and observe two very different query plans. Also resorted to the FORCESEEK hint. I've tried to attach the XML to this post. Plan1.sqlpan is very complex, and Plan2.sqlplan appears relative simple.

As stated before the requirement is to go back 7 days, and I have tried in stages to start with 1 day, then 2, then 3 and observe performance and plans.

Sometimes I also get a missing index very similar to the above existing index, but with different key order and paramValue is an included column.

If the optimizer is not leveraging the non-clustered index, that would explain the disparity in performance. I haven't yet spun up profiler or gotten brave enough with extended events yet to see if this is the case. I think the approach before the Probe table was sharded and the distributed view came into the picture was to use explicit INDEX query hints, which has been now been negated by the view.

Any insights or pointers into how I should troubleshoot this critter are greatly appreciated!

I would LOVE to help you with this, but it goes WAY beyond the primary purpose of a forum post - that being short, targeted, straight-forward assistance. What you have isn't any of those.

My best advice is to get a really sharp performance tuning professional on board for a review. There are some good ones on this forum and elsewhere. Perhaps someone else will decide to jump in on their free time and try to help you out with this one. Best of luck!

I appreciate the response, and yes I don't expect someone to volunteer for a remote DBA session and solve my problem for free. I just tried to flesh out a full background context and asked for some general pointers on approaches to take to solve this problem.

I was hoping for insights on some general approaches like "I would watch DMV 'abc' or you really need to boot up a profiler trace and watch 'xyz'." I know I've seen helpful general hints like this on this forum before. I'm trying to understand reasons why the optimizer will choose a completely different plan or disregard an index. Yes I know the stock answer is always "it depends", but that list of things it depends on is not infinite. Isn't the point of this forum to encourage learning and foster understanding?

I've dug a bit deeper, and I think what is happening is that sometimes the generated query plan undergoes partition elimination at compile time versus at runtime. When at runtime, I have verified that the number of executions is zero for tables with check constraints outside my range.

I still don't quite understand why the non-clustered index is sometimes ignored when it should be covering, but I have had the most luck with the FORCESEEK query hint.

jallmond (7/16/2013)UPDATE: For anyone that may encounter a similar issue.

I've dug a bit deeper, and I think what is happening is that sometimes the generated query plan undergoes partition elimination at compile time versus at runtime. When at runtime, I have verified that the number of executions is zero for tables with check constraints outside my range.

I still don't quite understand why the non-clustered index is sometimes ignored when it should be covering, but I have had the most luck with the FORCESEEK query hint.

Thanks for posting the follow-up. Lots of posters don't realize that this makes the forum MUCH more valuable as people searching can find problems AND solutions!

I think another of the root causes is that DPVs just never got any real loving from the product team and haven't had much attention in many years I don't think.

I don't know that this will help much, but, I'd look at the properties on the SELECT operator. Is the optimizer completing optimization or is it timing out? If the latter, you may just be seeing plan instability. In that case, I think you're already on the right track with a few query hints to move the optimizer in the right direction. If the former... I'm not sure what's going on.

In difference to Kevin, I think that that there is some hope that we may be able to help you in this forum, and save you the cost from hiring a consultant. But no promises, we'll see.

In any case, the difference between the two plans is that in the good plan, the optimizer did partition elimination just fine, and only queried the database for July. In the bad plan, it queries all tables in the partitioned view.

So why would the optimizer do this? I have some theories, but I first need to see the view definition, and the definition of all tables, including their indexes, in the view. (Please post the table scripts as an attachment.)