Discussions on software engineering

My TransientFaultHandling utilitary classes for DocumentDB

Keluro uses extensively DocumentDB for data persistence. However, it’s extensive scaling capabilities come with a price. Indeed with your queries or your commands you may exceed the amout of request unit you are granted. In that case you will received a 429 error “Request rate too large” or a “DocumentException” if you use the .NET SDK. It is your responsability then to implement the retry policies to avoid such a failure and wait the proper amout of time before retrying.

Edit: look at the comment below. The release v1.8.0 of the .NET SDK proposes some settings options for these retry policies.

Some samples are provided by Microsoft on how to handle this 429 “Request too large error”, but they are concerning only commands, such as inserting or deleting a document, there is no sample on own to implement the retry policies for common queries. A Nuget package is also available: “Microsoft.Azure.Documents.Client.TransientFaultHandling” but even if integrating it is as quick as an eye blink, there is no logging capabilities. In my case it did not really resolve my exceeding RU problem, I even doubt that I made it work and the code is not opensource. Then, I decided to integrate the ideas from the samples in own utilitary classes on top of the DocumentDB .NET SDK.

The idea is similar to “TransientFaultHandling” package: to wrap the DocumentClient inside another class exposed only through an interface. By all accounts, it is a good thing to abstract the DocumentClient behind an interface for testability purposes. In our case this interface is named IDocumentClientWrapped.

Instead of returning an Queryable<T> instance as DocumentClient would do we return an IRetryQueryable<T>. This latter type, whose definition will follow, is also a wrapper on the IQueryable<T> instance returned by the DocumentDB client. However, this interface explicitely retries when the enumeration fails because of 429 request too large exception raised by the database engine, DocumentDB in our case.

In this interface we only expose the method extension methods that are actually supported by the “real” IQueryable<T> instance returned by DocumentDB: Select, SelectMany, Where etc. For example, at the time of the writing GroupBy is not supported. You would get an runtime exception if you used it directly on the IQueryable<T> instance returned by DocumentClient.

In this case the Where constraint is perfomed “in memory” by your .NET application server. It means that you have fetched all data from DocumentDB to your app server. If MyType contains a lot of data, then all have been transfered from DocumentDB to your application server and/or if the Where constraint filters a lot of documents you will probably have a bottleneck.

Let us get back to our problem. Now that we saw that having retry policy for a query means only calling AsRetryEnumerable() instead of AsEnumerable() let us jump to the implementation of thoses classes.

The idea is to use an IEnumerator that “retries” and use two utility method: ExecuteWithRetry,ExecuteWithRetryAsync. The former one for basic mono threaded calls while the latter is for the async/await context. Most of this code is verbose because it is only wrapping implementation. I hope it will be helpful for others.

Glad to hear that you use DocumentDB as your NoSQL data store. I went through your post and wanted to mention that we have built-in support for basic retry mechanism for 429s for all requests(query as well as commands) starting .NET SDK 1.8.0 release of DocumentDB.

This is done by exposing a RetryOptions property on the ConnectionPolicy instance that gets passed to DocumentClient constructor. By default, all requests will be retried 9 times(so that you have 10 attempts for each request) and it will use the retryAfter response header to determine how much to wait between each request. There is a max wait time set to 30 sec for each request after which it will throw. Both these values(MaxRetryAttemptsOnThrottledRequests and MaxRetryWaitTimeInSeconds) can be overridden on the RetryOptions instance.

Please give it a try and let us know if you have any feedback to further improve the retry mechanism.

We will look into the tracing issue that you mentioned to see if there are better ways to expose the retry related information for each request.

Hi Rajesh,
thank you very much for your insights. Indeed, my .NET SDK package as an update waiting: 1.8.0 version. I will definitely have a look at this ! Maybe you should mark the Nuget package “Microsoft.Azure.Documents.Client.TransientFaultHandling” as deprecated.