Wednesday, April 25, 2012

Working with databases under load sooner or later leads to timeouts. Sometimes they occur for a good reason, sometimes not. Either way, you have to cope. The reason for the timeout is very rarely the wrong timeout period value.

Good reasons for timeouts: You have a long-running task, such as taking a backup, restoring some important data, or any other kind of task that simply will take time.

Bad reasons: You have a poorly written (or poorly optimized, as it often happens) query, or you are placing too much load on the server(s).

As mentioned, either way you have to cope, when it happens. But in the case of the bad reasons, you really should try and remove the reason for the timeout.

I won't dive into the universe of sql query optimizing here.

When a timeout occurs, the reason is that the server is busy for some reason, and you basically have to wait, and check again later, if it's free. It's like when a person is busy, if you keep poking them to ask if they're done yet, chances are it will just delay them. And when one timeout happens, more will follow. Many times I've seen timeouts occuring on one part of the server, related to badly written queries in a totally different part of the server.

I see sometimes people suggest to increase the timeout period. But in my opinion, unless you have a ridiculously low timeout value, this is not a solution, simply a way of brushing it under the carpet, or treating the sypmtoms instead of the disease.

If you're queuing up your queries, or processing them async, and you increase your timeout you will have more queries in the pipeline when the timeout occurs, and you'll have a bigger mess to clean up.

You have to know your domain, and set your timeouts accordingly. Shorter timeout will at some point lead to more timeouts occuring, but they will occur sooner and you will spend less time cleaning up. If your query never should take more than 5 seconds to execute, and your timeout is set to 30 seconds, that's 25 seconds spent in vain. Which may sound picky, but it all adds up...

You should also know your system well enough to be able to prioritize the different queries - if something is not important, leave it for later (on a low priority queue, for example) or even just log it and move on. In fact, if logging it and moving on is sufficient, explore the option of removing the query altogether. It sounds like it may not be that important.

If it is important, and you really need to carry out the query (which is usually the case) then you have a few options. In both of them I'd suggest catching SqlExceptions and then try to determine if they were caused by a timeout (or a deadlock - this approach also works for them). If that is the case, let the thread sleep for a short while (to allow whatever is causing the issue to resolve), then do one of two things.

With a thread processing a queue, you could peek to get the current object, and try and carry out the operation on it. Depending on the result, you could dequeue the element - or just return, and the next time the method is invoked, it will peek the same object from the queue and try to perform the same operation again.

If you're processing data directly, like calling a stored proc or performing a LINQ-to-SQL operation, you can try and call the method recursively. Yes, it doesn't feel as good as the other approach, however it's easy to implement in a smaller solution without refactoring to the queue processing solution. (One thing to be aware of here, is that if you call the method recursively enough times, you will end up with a StackOverflowException. You can avoid this by checking the current call stack, using System.Diagnostics.StackTrace.)

In the end, if you do have a lot of problems with timeouts, and (or) you really need to be sure that your SQL work is carried out, no matter what - you need to look to Microsoft Message Queueing, or MSMQ, which is Microsoft's suggested solution. Leave the critical SQL stuff to the MSMQ handler and let your application simply be a MSMQ messenger.