Conor vs. Traceflags

I got a question from a reader asking me to explain various cases for the use of one of our traceflags. The flag in question doesn’t matter per se (the only official documentation on the flag, 2301, is at http://support.microsoft.com/kb/920093). I am going to talk a bit about some of the things we do when developing new features and why such a traceflag might exist and might be off by default.

Optimizers are hard to build. Well, after awhile you can figure out how to optimize well for any single kind of workload. Building optimizers that work for many kinds of workloads is hard. Often there are problems in the way data is modeled, the compilation time, the memory use of the plans, or SOMETHING that causes problems. We’re pretty much *always* working on some set of problems that we’d like to find a way to improve the overall plan quality for our customers. Some of those ideas may be good. Some of those ideas may ultimately seem like good ideas but are not good enough or complete enough to go into the full product.

We have lots of traceflags. Some are documented and supported. Some are undocumented or unsupported. 2301 is documented but is not on by default (and likely never will be). It can help some workloads that do DSS/analytics but it hurts compilation time. It deals mostly with how we do cardinality estimation, which is the process where we determine how many rows we think each operator in our algebra representation of your query will return. When cardinality estimates are pretty close, the optimizer does a pretty good job in almost every case. When it is wrong, things are less consistently good (meaning that sometimes we will still pick a pretty good plan but other times we do so less reliably). Good cardinality estimates are important, but the cost to get them can be extremely high for some scenarios.

Some other blog posts on this subject have discussed having the optimizer “keep going” as a good thing, perhaps in the same manner as getting “more exercise” is a “good thing”. In some sense, the optimizer might be able to find a better plan in some cases for some queries when it searches more. However, this is usually not a global guarantee. In other words, searching more is only valuable if the cardinality and costing models are perfect. In practice, these will have inaccuracies. When we talk about plan quality internally, we actually will evaluate a workload and measure whether it is better/worse overall, not for individual queries. That’s because there _will be_ queries that are slower. If there are large numbers of these or large outliers, then that kind of workload may be worse off with “more searching” because really it’s a probability density function, not “more exercise”.

Generally, motivated users of our software can go try various traceflags for their workload and see if it helps them. I don’t want to go into depth on this because I don’t have your workload and giving you guidance isn’t really my goal here. The point is that we did a traceflag in this case that isn’t on-by-default and won’t be. We do those things sometimes so that we can try different workload classes and see how they behave. Testing out ideas is key to learning, and often a few key traceflags, even if they don’t show up in the on-by-default feature set, actually are quite helpful to us because they can be useful to bake ideas for future features, to help us better classify workloads, and to generally get clarity on a space where we needed more clarity in evaluating our models and features against what our customers were doing with the product.

I don’t recommend you use traceflags unless you are having specific problems in that area – if you are a DSS workload, this is something you can go try. If you are an OLTP workload, don’t try. If you don’t know what kind of workload you are, please don’t turn it on . It’s not a full feature, and even in DSS workloads it might not really be good for your workload.

(So I guess this is a long-winded way of saying “sorry, I’m not going to talk about that TF” )

I would like to add that there should be a setting which I could use to instruct optimizer to get the best plan even if takes say 5 minutes (even though overall query might run in just a matter of seconds). I could use this feature in the dev envrionement to find what is the best plan for my query.

When the setting is off and the optimizer is not picking the best plan. I might use some kind of hint to get the best plan.