Default feed endpoint issues 7 February 2013 - Post mortem

On February 7, 2013, we've experienced a partial downtime. The default endpoint for consuming NuGet feeds was slow to respond (+/- 2 minutes for a simple request) and caused issues with a lot of our users and their development teams and processes. We share your pain: we use MyGet for developing MyGet and we hate when things like this happen as much as you do. Our sincerest apologies for the inconvenience caused!

In this post, we'll have a look at what happened and the reason this happened. But before we dive in: we've had 3 hours of partial downtime on the default feed endpoint which most users have configured. We want to apologize by extending all paid subscriptions (Starter, Professional and Enterprise) with one additional month. No action is required on your end, your subscription has already been extended.

What happened?

At around 3:15 PM (CET), our monitoring alerted us about latency on the default feed endpoint going up. After a quick investigation and some additional time, latency went back to normal. One hour later, around 4:15 PM (CET), we saw the same happening again, as well as a number of support e-mails coming in.

/F/<your_feed_name>/api/v1 - the NuGet v1 API endpoint for consuming and pushing packages (still in use by Orchard CMS and some others)

</ul>

In this case, the default endpoint we offer our users was experiencing issues, the other endpoints were not. We decided to reply all support requests with the advice to switch to the v2 feed endpoint to make sure you could continue work. This requires some reconfiguration, but we figured it's better than having very slow access and timeouts connecting to your feeds.

We checked logs, diagnostics, monitoring reports, read and re-read our code, but could not pinpoint any reason for the default feed endpoint having such high latency. At 5:00 PM (CET) we decided to fail-over to our secondary datacenter. Being a separate environment, we wanted to see if the issue would haoppen there as well. It didn't! So we flipped the DR switch and took our primary datacenter out of the loop to be able to investigate the servers without causing additional issues for our users. This fail-over solved the issue for most of our users, who were able to work with MyGet's default feed endpoint from around 6:00 PM (CET), mind some DNS propagation for some users.

Digging through more logs, we discovered the issue that caused the default feed endpoint to experience this latency (we'll elaborate in a minute). At 8:00 PM (CET), we rolled out a hotfix to our secondary datacenter to make sure the issue would not resurface there either. We worked on a final fix for the next few hours and at 11:30 PM (CET), we flipped the switch back to our primary datacenter.

Why was the default feed endpoint so slow?

As you know, MyGet supports adding upstream feeds and allows to proxy them. This is a very useful feature as one feed can essentially bundle several others, so a developer team only has to configure one and still get packages from multiple.

MyGet treats all upstream package sources as external feeds, even if an upstream feed is a MyGet feed. The reason for that is security contexts can be diferent and we want to have all security aspects applied at any time. This does mean that a MyGet feed having an upstream MyGet feed as a package source is effectively using the default feed endpoint of this second feed, going oer our internal network through the load balancer. Next to security, this also allows us to fan out queries going to multiple package sources over multiple machines.

One of our users configured upstream package sources that were referencing themselves: feed A referencing feed B referencing feed A. We prohibit self-referencing feeds, but the scenario this user tried to achieve was unsupported and resulted in a reference loop on the default feed endpoint. The default feed endpoint was bringing itself down by fanning out queries to itself.

Previous incidents teached us to partition as many things as possible, and luckily we partition all alternative feed endpoints. This allowed us to direct support requests towards the other feed endpoints so users could keep using our service.

Solution

As a quick fix, we disabled the faulting package source so our service could be resumed in a stable fashion. Once that was done, we worked on a permanent solution to this issue. Do we want to block self-referencing feeds? Absolutely, as that makes no sense. Do we want to block adding other MyGet feeds as upstream package sources? No.

We want to keep supporting having upstream package sources that can be MyGet feeds, so that complex hierarchies of feeds and privileges can be configured. Quite a number of feeds do this today, having one feed proxying a series of upstream feeds.

How do we prevent the same problem from happening again? We are now actively tracking refering feeds and detecting loops in these references. Doing this, we keep supporting all scenarios that were possible before while protecting ourselves from shooting ourselves in the foot. Feed A can reference feed B, which can reference feed A. Users can make use of feed A as the entry pont, as well as feed B, and get all expected packages on the feed. We will make sure reference loops are removed from the feed hierarchy.

When a MyGet feed references a non-MyGet feed, we will also be adding an HTTP header to all upstream requests, X-NuGet-Feedchain, which contains the full list of refering feeds. Other NuGet feed implementations can make use of this header to detect reference loops on their end as well.

While we replied to all support requests in under 5 minutes with a workaround to use one of the other endpoints, we will be working on extending our status page at http://status.myget.org, showing details about all endpoints. We also want to be able to post status messages and tips & tricks like switching to the secondary feed endpoints on there.

Hardware fails, software has bugs and people make mistakes. We strive to mitigate as many of these factors and maintain a high quality service with fast, good support. We'll keep doing that to provide you with the best experience possible.