Failure mode analysis

03/24/2017

17 minutes to read

Contributors

In this article

Failure mode analysis (FMA) is a process for building resiliency into a system, by identifying possible failure points in the system. The FMA should be part of the architecture and design phases, so that you can build failure recovery into the system from the beginning.

Here is the general process to conduct an FMA:

Identify all of the components in the system. Include external dependencies, such as as identity providers, third-party services, and so on.

For each component, identify potential failures that could occur. A single component may have more than one failure mode. For example, you should consider read failures and write failures separately, because the impact and possible mitigations will be different.

Rate each failure mode according to its overall risk. Consider these factors:

What is the likelihood of the failure. Is it relatively common? Extrememly rare? You don't need exact numbers; the purpose is to help rank the priority.

What is the impact on the application, in terms of availability, data loss, monetary cost, and business disruption?

For each failure mode, determine how the application will respond and recover. Consider tradeoffs in cost and application complexity.

As a starting point for your FMA process, this article contains a catalog of potential failure modes and their mitigations. The catalog is organized by technology or Azure service, plus a general category for application-level design. The catalog is not exhaustive, but covers many of the core Azure services.

App Service

App Service app shuts down.

Detection. Possible causes:

Expected shutdown

An operator shuts down the application; for example, using the Azure portal.

The app was unloaded because it was idle. (Only if the Always On setting is disabled.)

Unexpected shutdown

The app crashes.

An App Service VM instance becomes unavailable.

Application_End logging will catch the app domain shutdown (soft process crash) and is the only way to catch the application domain shutdowns.

Recovery

If the shutdown was expected, use the application's shutdown event to shut down gracefully. For example, in ASP.NET, use the Application_End method.

If the application was unloaded while idle, it is automatically restarted on the next request. However, you will incur the "cold start" cost.

Azure Search

Writing data to Azure Search fails.

The Search .NET SDK automatically retries after transient failures. Any exceptions thrown by the client SDK should be treated as non-transient errors.

The default retry policy uses exponential back-off. To use a different retry policy, call SetRetryPolicy on the SearchIndexClient or SearchServiceClient class. For more information, see Automatic Retries.

Reading data from Azure Search fails.

The Search .NET SDK automatically retries after transient failures. Any exceptions thrown by the client SDK should be treated as non-transient errors.

The default retry policy uses exponential back-off. To use a different retry policy, call SetRetryPolicy on the SearchIndexClient or SearchServiceClient class. For more information, see Automatic Retries.

Cassandra

Reading or writing to a node fails.

Detection. Catch the exception. For .NET clients, this will typically be System.Web.HttpException. Other client may have other exception types. For more information, see Cassandra error handling done right.

Cosmos DB

Reading data fails.

The SDK automatically retries failed attempts. To set the number of retries and the maximum wait time, configure ConnectionPolicy.RetryOptions. Exceptions that the client raises are either beyond the retry policy or are not transient errors.

If Cosmos DB throttles the client, it returns an HTTP 429 error. Check the status code in the DocumentClientException. If you are getting error 429 consistently, consider increasing the throughput value of the collection.

If you are using the MongoDB API, the service returns error code 16500 when throttling.

Replicate the Cosmos DB database across two or more regions. All replicas are readable. Using the client SDKs, specify the PreferredLocations parameter. This is an ordered list of Azure regions. All reads will be sent to the first available region in the list. If the request fails, the client will try the other regions in the list, in order. For more information, see How to setup Azure Cosmos DB global distribution using the SQL API.

Diagnostics. Log all errors on the client side.

Writing data fails.

The SDK automatically retries failed attempts. To set the number of retries and the maximum wait time, configure ConnectionPolicy.RetryOptions. Exceptions that the client raises are either beyond the retry policy or are not transient errors.

If Cosmos DB throttles the client, it returns an HTTP 429 error. Check the status code in the DocumentClientException. If you are getting error 429 consistently, consider increasing the throughput value of the collection.

Replicate the Cosmos DB database across two or more regions. If the primary region fails, another region will be promoted to write. You can also trigger a failover manually. The SDK does automatic discovery and routing, so application code continues to work after a failover. During the failover period (typically minutes), write operations will have higher latency, as the SDK finds the new write region.
For more information, see How to setup Azure Cosmos DB global distribution using the SQL API.

As a fallback, persist the document to a backup queue, and process the queue later.

Messages that cannot be delivered to any receiver are placed in a dead-letter queue. Use this queue to see which messages could not be received. There is no automatic cleanup of the dead-letter queue. Messages remain there until you explicitly retrieve them. See Overview of Service Bus dead-letter queues.

Writing a message to a Service Bus queue fails.

Detection. Catch exceptions from the client SDK. The base class for Service Bus exceptions is MessagingException. If the error is transient, the IsTransient property is true.

The Service Bus client automatically retries after transient errors. By default, it uses exponential back-off. After the maximum retry count or maximum timeout period, the client throws an exception. For more information, see Service Bus retry guidelines.

If the queue quota is exceeded, the client throws QuotaExceededException. The exception message gives more details. Drain some messages from the queue before retrying, and consider using the Circuit Breaker pattern to avoid continued retries while the quota is exceeded. Also, make sure the BrokeredMessage.TimeToLive property is not set too high.

Within a region, resiliency can be improved by using partitioned queues or topics. A non-partitioned queue or topic is assigned to one messaging store. If this messaging store is unavailable, all operations on that queue or topic will fail. A partitioned queue or topic is partitioned across multiple messaging stores.

For additional resiliency, create two Service Bus namespaces in different regions, and replicate the messages. You can use either active replication or passive replication.

Active replication: The client sends every message to both queues. The receiver listens on both queues. Tag messages with a unique identifier, so the client can discard duplicate messages.

Passive replication: The client sends the message to one queue. If there is an error, the client falls back to the other queue. The receiver listens on both queues. This approach reduces the number of duplicate messages that are sent. However, the receiver must still handle duplicate messages.

Duplicate message.

Detection. Examine the MessageId and DeliveryCount properties of the message.

Recovery

If possible, design your message processing operations to be idempotent. Otherwise, store message IDs of messages that are already processed, and check the ID before processing a message.

Enable duplicate detection, by creating the queue with RequiresDuplicateDetection set to true. With this setting, Service Bus automatically deletes any message that is sent with the same MessageId as a previous message. Note the following:

This setting prevents duplicate messages from being put into the queue. It doesn't prevent a receiver from processing the same message more than once.

Duplicate detection has a time window. If a duplicate is sent beyond this window, it won't be detected.

Diagnostics. Log duplicated messages.

The application cannot process a particular message from the queue.

Detection. Application specific. For example, the message contains invalid data, or the business logic fails for some reason.

Recovery

There are two failure modes to consider.

The receiver detects the failure. In this case, move the message to the dead-letter queue. Later, run a separate process to examine the messages in the dead-letter queue.

The receiver fails in the middle of processing the message — for example, due to an unhandled exception. To handle this case, use PeekLock mode. In this mode, if the lock expires, the message becomes available to other receivers. If the message exceeds the maximum delivery count or the time-to-live, the message is automatically moved to the dead-letter queue.

VM instance becomes unavailable or unhealthy.

Recovery. For each application tier, put multiple VM instances into the same availability set, and place a load balancer in front of the VMs. If the health probe fails, the Load Balancer stops sending new connections to the unhealthy instance.