Our host prgmr had a networking outage starting around 6:45 Central time (GMT-5) this morning.
A switch rebooted and, after 30m, entered a reboot loop, taking their hosting entirely offline.
They dispatched an employee to the data center to configure and install a backup switch.

When that was completed we came back online with everyone else.
Unfortunately, our Let’s Encrypt cert was due to renew during the networking outage.
So that didn’t happen and we came back up with an expired security certificate.
@alynpost reissued the cert and we bounced nginx to get it working.

Between networking and the certificate were offline for about three hours.
No data was lost, our vps did not go down.
I don’t believe any part of this outage was malicious or has any security concerns.

Our hosting is donated by @alynpost, owner of prgmr and longtime lobster.
During the networking outage I asked not to receive any updates unless all their customers were back up and something extra was wrong with Lobsters.
I didn’t want to distract from their paying customers on what must be a very busy morning.
We have no plans to change anything about our hosting.

And as long as I’m writing an announce post many people will see, yesterday I fixed the bug that broke replies via mailing list mode.

I haven’t received any reports of users running NixOS, but typically folks would only reach out to me i they were having a problem. You can certainly boot up a live rescue and run an install over the serial console. Depending on the distribution this either ‘just works’ or requires it be told the console is on the serial port.

NixOS ISOs from their website do not enable the serial console by default, but building a custom ISO which does is easy enough. I did so a few days ago on Debian using nix to create an NixOS installer for my APU2:

I actually tried a few months ago, but gave up because I had thought I figured out it was impossible. Although, seeing the link that @alynpost just posted, I might give it another go when I have some free time.

Some years ago, a friend taught me a simple trick which I used twice to install OpenBSD at providers where neither OpenBSD nor custom ISOs were directly supported: We would build or download a statically linked build of Qemu, boot the VPS into its rescue image and start Qemu with the actual hard disk of the VPS as disk and an ISO to boot from. Thats not too hard and works for pretty much everything where you got a rescue system with internet access. I guess it should work for NixOS too and maybe nix could even be used for the qemu build ;)

My pager went off this morning telling me it had one, then twelve, then forty alerts. That many problems all at once is a sure sign of either network, power, or monitoring system failure. On investigating we discovered one of our switches had spontaneously reboot. By the time we’d determined we had a switch reboot it was back up: we decided we’d schedule a maintenance window for the weekend and replace it Saturday.

Alas, the switch didn’t wait. It reboot again later in the morning and then kept doing it. We keep spare equipment at the data center and brought a new switch online.

For network failures like this we have a so-called out-of-band console. We’re able to log in to prgmr.com and debug network failures remotely. From this out-of-band connection I was able to look at the logs on our serial console and determine which equipment had reboot. Further I was able to see all of our equipment was still powered on–meaning switch failure rather than power failure.

We do have some single points of failure in our network. We’ve been eliminating them as time permits but it is a work in progress.

We peer with HE and use NTT as a backup. If the event of an outage with HE we can get to the Internet with NTT instead. This switch that failed also had a ‘backup’ but not quite as warm as our peering connection. As a result of this failure we’ll be experimenting with Multi-Chasis Link Aggregation (MLAG).

Do you have any experience with MLAG or other switch-level redundancy?

Do you have any experience with MLAG or other switch-level redundancy?

I have a bit of experience with HPE’s IRF (on Comware), but only in a lab environment. As far as I can tell once you set it up (not terribly difficult) it works as advertised. The downstream devices obviously should be connected to at least two physical switches in an aggregated switch domain, and preferably use some link aggregation, like LACP or various Linux link bonding techniques.

I’m curious why the LE certs are only set to renew within 3 hours of them expiring?

Edit: more specifically, is it a deliberate choice (and if so, please explain why?) or is it the ‘default’ for some esoteric ACME certificate renewal client that nobody else has ever used in production (and thus this weird choice)?

Looked into this. LE was not, in fact, set to renew from any cron job or other means. @alynpost did it manually in January and the expiration happened to fall during the outage. I’ve added it to the open LE issue.

It sounds to me, that certbot was trying to renew during the network outage, and upon failing the ACME challenge got into a stuck state/gave up renewing - that being until NGINX was kicked and a new cert was manually issued.

What toolchain do you use? We had a wrinkle along the way that certbot was reporting the cert was not yet up for renewal. @nanny guessed that nginx needed to be restarted and I did that at the same time @alynpost force-reissued the cert and one or both of those fixed the issue for us.

If you have the spare time, I’d love a PR to our ansible repo to move to acme-client. We haven’t had the expertise to change over with confidence, and it sounds like you’d avoid at least one failure mode we hadn’t thought of. :)

Network (switch) failures are particularly frustrating to deal with because we all work over IRC and self-host both a server and our bouncer. Since we can’t log in it’s difficult to chat with each other: I was able to connect to my IRC client over out out-of-band network but the connection between it and our bouncer was down.

Everyone scrambles for a bit to get on Freenode without using the bouncer–once that’s done and we’ve figured out what everyone’s temporary nick is and set about restoring service.

Internally we refer to that as a “label.” You’re correct, it doesn’t match the limitations in DNS. That label is used to not only set a hostname in .xen.prgmr.com, but also shows up in other systems. Most of our customers ignore this label in favor of setting up their own domain. In DNS you can set any valid A record for the IPv4 associated with your VPS. If you set an rDNS record for us you’ve got identical flexibility.

All that said, that limitation may also exist for legacy reasons that are no longer valid. Did you try to enter a hostname it would not accept or is it that you noticed it was more restrictive than it might need to be? I’m happy to look at changing the constraints here.

Great and simple postmortem. Didn’t blame the problem on some ‘root cause’.

Unfortunately, our Let’s Encrypt cert was due to renew during the networking outage.

When people tell you there is a root cause of system failure, fire them immediately. This situation is a perfect and simple example where a cascade of failures led to an outage. Root causes are for ants.

There’s some more notes in the comments here - that statement you quote was inaccurate, we’d have been going down anyways for the cert expiring. Just bad luck on timing that it happened to be the same time.

I found the comment about doing the renew manually. I think if anything, this makes your example even stronger. Meaning there is no single failure here, and it took a combination of all of them to create the long downtime you saw. In other words, there is no root cause.

I think you guys did a great job and I even learned about a new (to me) old service, prgmr!

FWIW, for Lets Encrypt I use acme.sh in a daily cronjob that renews the cert every 30 days (i.e. if it’s within 60 days of expiration). Then I have alerts that fire if the time runs down. It’s kind of aggressive but it protects against non-outage risks like subtle API changes or bugs.