Understanding and resolving failures in Windows Store apps

As Windows Store app builders, we’re sure you’re aware that apps can experience a variety of failures: hangs, crashes, unavailable services and resources, and any number of unexpected operating conditions. These failures impact your customers’ experience, and if left unaddressed they can even cause customers to stop using your app or move to a competitor’s product. What you might not be aware of is that Microsoft makes this data available to help you build a more stable, reliable app that your customers will love. This post gives you a brief introduction of the Windows Error Reporting (WER) telemetry and how it’s available in the Store dashboard.

In addition, we’ll share some of the “real world best practices” guidance that have emerged from WER telemetry including insights from Windows Store apps, particularly where many apps are failing due to the same coding and design issues. We hope that this content will inform you, as a Windows Store app developer, of some of the unexpected failures that might occur post-release, and will help you build better, more reliable apps.

Windows Error Reporting

Since Windows XP, Microsoft operating systems have provided a failure reporting mechanism called Windows Error Reporting (WER). When a user opts in to allow their system to send telemetry data back to Microsoft, WER sends back crash and hang reports, including those from Windows Store apps. A dedicated telemetry system processes the error reports, collects crash and hang dumps, builds failure curves and finds the highest-hitting issues. The MSDN article How WER Collects and Classifies Error Reports offers a high-level view. Additionally, a paper from Microsoft Research titled Debugging in the (Very) Large: Ten Years of Implementation and Experience provides very in-depth information on this process.

The telemetry from WER is used not only within Microsoft, it is also shared with many partners including IHVs (hardware vendors), ISVs (software vendors), and OEMs (system manufacturers), and it is used for direct outreach for the most severe failures. The portal at http://sysdev.microsoft.com provides WER telemetry to ISVs, IHVs and OEMs in the form of metrics and failure curves that show what the top issues are, and in hang and crash dumps that can be downloaded for debugging.

As might be expected, the broad range of WER telemetry gives Microsoft a unique perspective on the failures affecting apps. While the Store dashboard displays failures affecting published apps for a given developer account, Microsoft can see failures across all apps, including failures that are affecting many apps simultaneously. These typically show up in two forms: failures in platform and framework binaries (such as Windows.UI.Xaml.dll), and as crashes of specific types that occur across a large variety of apps.

Failures that affect multiple apps (we’ll call them “multi-app failures” for brevity) may not cause the most crashes for any specific app, but can be significant issues when viewed across all apps. Multi-app failures discovered through the WER telemetry reveal situations where caution is needed in app design. WER telemetry can be considered an augmentation to in-house testing, as it reveals problems that occur in a very diverse population with diverse environments.

In addition, the NT Debugging blog has also posted a technically deep article on how to debug Windows Store crash dumps to get the error code and call stack of the issue. Please read Debugging a Windows 8.1 Store App Crash Dump for details on investigating the most common form of Windows Store app crashes.

Guidance on avoiding these failures represents best practice recommendations based on actual customer experience. Implementing the two recommendations below (try/catch and null pointers) will add considerable stability to any app. With any code change, the change must be done with care so that the effect is expected and the outcome is desirable – swallowing an exception without any indication may result in a confused user: “I tapped the button but nothing happened?!”.

Best Practice – try/catch blocks

One thing we frequently see through WER telemetry is when apps don’t handle exceptions appropriately via try/catch blocks. Lots of apps assume that a network call will succeed or some XML will parse. In the real world, these (normally successful) operations will fail at times, for unpredictable reasons. It’s good to assume that the size and diversity of the Windows ecosystem could cause edge cases in your code to be executed. To have a stable app, you have to not trust the network connectivity, data integrity, or the access level of the user.

Mitigation is easy. Add a try/catch block around any code that isn’t guaranteed to succeed. You can choose to do many things with the exception. Hide it, log it, or inform the user. Continue on or gracefully fail. Consume it, throw it again, or throw it as an inner exception of a new exception you make. If it is to be thrown again, you should make sure that it will be caught elsewhere.

Check variables for null before using them as a pointer. Don’t assume that a function is being passed a valid object, and don’t assume functions you call return a valid object. These can be a follow-on effect of handling exceptions correctly. For example, a function might normally return an object, but if an error occurs it may return a null instead of allowing an exception to be thrown.

The adapted code snippet below shows how a simple if-statement can be used to avoid a function call that would result in an ArgumentNullException exception, and an object reference that would result in a NullReferenceException exception.

Using Best Practices helps avoid most problems and should be applied across your entire codebase. There are some specific issues that warrant attention, though, due to their prevalence or lack of awareness. Specifically, below we go over the incorrect handling of Navigation State (suspension), HTTP Requests and XML Parsing.

Windows.UI.Xaml.Controls.Frame.GetNavigationState

One of the components where we’ve identified a significant number of crashes is Windows.UI.Xaml.dll!Windows.UI.Xaml.Controls.Frame.GetNavigationState, which serializes the app’s frame navigation history (commonly at suspend). As documented on MSDN, serialization only supports four basic data types (string, char, numeric and GUID). In some cases, apps attempt to store more complex data, and this generates an exception when attempting to serialize the navigation state. Because this exception originates within a Microsoft platform component and not app code, the failures are not surfaced on the Windows Store dashboard and many developers are unaware that their apps are behaving incorrectly and are causing these crashes.

Two examples of the failing call stacks are shown below, showing where the failure is generated while preparing to be suspended. Note that the failure occurs when retrieving the accumulated navigation history; there is no opportunity to convert to supported types after retrieval.

The code sample below demonstrates how the failure might be reproduced. This sample adds a page and an invalid parameter type to the navigation backstack, and will fail when Frame.GetNavigationState is called.

Another example of a common failure scenario is when apps fail while making async HTTP calls via System.Net.Http.dll. The top-level error description returned is “An error occurred while sending the request”, our analysis indicates there are a wide variety of underlying causes. This is essentially a high-level “something went wrong” error that occurs when using asynchronous HTTP calls. These failures may be caused by transient network or DNS conditions; remote name resolution failure is one of the most common causes. This can be seen in crashes that are attempting to reach common URLs:

This is a case where your app does not cause the error, but still needs to be ready to handle unanticipated error conditions. Async network calls can be expected to fail any number of ways and should always be wrapped in try/catch blocks. Your apps should protect themselves against transient DNS name resolution failures, network congestion, slow response times, server (and service) unavailability, and abrupt connection loss. These are rarely tested and caught during the development phase, as most developers will not have access to test environments that exhaustively mimic networking conditions, and corresponding user behavior. With defensive exception handling in place, your app can handle transient network conditions gracefully and wait and retry, or notify the user and let them decide whether to retry, navigate to another activity in the app, or exit.

Note that Windows 8.1 offers the new Windows.Web.Http namespace that allows you to modularize your HTTP networking code, including filters you write to manage network communication and error handling. This may allow you to separate your app and business logic from your network error handling logic, for easier network failure management. Refer to the HttpClient sample for examples of using filters.

When analyzing crash data specific to Windows Store apps, we’ve seen a high prevalence of CLR exception codes related to XML parsing (exceptions of type System.Xml.XmlException). Apps are encountering XML parsing failures that were not seen during app development or during certification testing. These failures are being caused by many factors, from badly formed ads and RSS feeds to corrupted XML to invalid dynamic content. At the time of this writing, System.Xml.XmlException is the second most popular cause of multi-app crashes in Windows 8.1 after System.NullReferenceException.

What is extremely important to note here is that WER telemetry frequently shows XML failures in content that is outside the control of your app. Apps that use RSS feeds or ads are particularly affected, as RSS feeds and ad services occasionally deliver content that is auto-generated or malformed, causing app crashes during parsing. Try/catch blocks around the XML parsing code would permit your app to handle unexpected content gracefully. In the same manner as the networking failures mentioned above, defensive coding is especially important when consuming content that does not come from a source you control.

Parsing errors are the most common cause of XML failures. These appear in the WER telemetry in a variety of ways, due to the prevalence of XML parsing code in various forms. Reasons for failures can include unexpected tokens, invalid root data, unclosed elements, and failed XML character decodes. For reference, you can review this list of XML parsing exception codes.

The recommendation when using XML streams is to envision all of the things that can go wrong with streams – particularly remote streams – and practice defensive coding. For example, the underlying connection may be slow, or the network connection might terminate, or a shared FileStream object might be subject to unexpected stream position bugs.

The code sample below demonstrates a failure that can occur when reading from a FileStream, and reproduces the Example 1 call stack above.

The Windows Error Reporting telemetry contains an amazing wealth of information on how things can go wrong, and – in a backwards sort of way – gives a good roadmap on how to build a robust, reliable Windows Store app. We hope that providing this information is useful to you, and helps trigger your thinking and understanding about how your apps behave in the wild. We look forward to presenting more insights in future posts.

Disclaimer: SDKNews.com only syndicates the blog entries from various SDK blogs.
We are not the creator/author of these entries (posts). Product names, brand names
and company names mentioned on this site may be trademarks of their respective owners.