Intel Atom C2000 AVR54 Bug Strikes STH Today

Several STH readers will notice that we had downtime last evening. It sucks. Almost everything is back up, and performing well. While we are still finishing the process, we wanted to give a mini-postmortem to the STH community on what went wrong and also give a quick update into the next-generation that we are bringing online soon.

The Bad: Power Outage in the Data Center

This happened. We are in the same data center suite (for now) as Linode, Android Police and etc. Last evening, the power to the suite was cut. We do not have official word on what happened, but it took a few hours to resolve. When this suite goes down we are not the only ones impacted:

This time was different. There were three additional failures that we found after the power came back on.

The Worse: Failures Two through Four

After we got confirmation that power was back on, and Linode and others who have larger support teams and status pages, were online, we realized something was wrong. STH was still down. As power was restored we actually had three failures to deal with:

The primary firewall would not boot. It seems dead due to the AVR54 bug.

A switch connected to the emergency firewall was not powered on. We did not diagnose this on-site this morning as it will be replaced with a new 10GbE/40GbE switch next week anyway. A potential cause may be a dead PDU port.

A dual Intel Xeon E5-2699 V3 server decided it did not want to finish booting due to SATA SSD linkages.

It took about 30 minutes to design and build an emergency firewall appliance and get software loaded then another 2 minutes to make coffee and a 15-minute drive to the data center. We spent a few minutes diagnosing the issue (firewall) to ensure it was indeed the problem before replacing.

At STH, we have an in-line firewall that is designed to cut traffic when it dies. There are other options for this such as the bypass designs that we covered in our Supermicro A1SRM-LN7F-2758 Review. Given we do not sell products or services on STH, a few minutes of downtime is essentially a rounding error for our application so the safer option is to endure a few minutes of downtime during maintenance.

This time, we simply swapped the old unit out for a Xeon D-1500 based solution and got everything running. It was faster than bringing everything up in another data center, then migrating everything back to the original infrastructure.

On the Mend: Cleaning-up and the Future

The STH hosting cluster is actually smaller in anticipation of updating over the next ten days or so. We normally only visit the hosting racks once every other quarter, usually when we find things like: What is the ZFS ZIL SLOG and what makes a good one that get us to upgrade immediately. The dual Intel Xeon E5-2699 V3 node was offline while the site was still running, but when it came back online and everything was back to being balanced STH page load times went from about 2.85s to 1.42s which is a big improvement.

What is next for our environment? CPUs and systems are in. We are preparing for an EPYC upgrade next. Stay tuned. The switch that went down was going to be replaced anyway, so that is getting a significant upgrade next time.

AMD EPYC 7000 Series Retail Box And Badge

Indeed. If you buy a retail AMD EPYC 7000 series processor, there is an enormous case badge in the package that will not fit a rackmount server bezel.

There are certainly things that can be designed better in our infrastructure. We have plenty of bandwidth, colo space, and all of the hardware one could want. Part of STH’s early mission was to find the balance between overbuilding (and seeing failures from complexity) and underbuilding (and seeing failures like these.) That is still a journey. We should not have left the firewall with the AVR54 bug installed. This was a risk that we took and it bit us.

Patrick has been running STH since 2009 and covers a wide variety of SME, SMB, and SOHO IT topics. Patrick is a consultant in the technology industry and has worked with numerous large hardware and storage vendors in the Silicon Valley. The goal of STH is simply to help users find some information about server, storage and networking, building blocks. If you have any helpful information please feel free to post on the forums.

Lots of datacenters do not offer any UPS system. The kind of power one needs in a datacenter would require way too many diesel generators or way too many batteries. These days, it is much cheaper to use HA methods than trying to avoid downtime at any cost.

I just lost an older Atom C2758 Supermicro 1U SYS-5018A-FTN4 last week at home. I had a power failure that lasted through the UPS reserves and when the power was restored the server will not boot. Using the IPMI interface I can control the BMC but the iKVM stays black. 3 years and dead.

Thanks for posting this, it’s good to have some real world documentation of these things since typically people just have to deal with them silently behind the scenes and wonder if anyone else is running into the same thing.
Also, a link could be added to the 2017-02-07 article pointing to this one for reference.

“It took about 30 minutes to design and build an emergency firewall appliance and get software loaded then another 2 minutes to make coffee and a 15-minute drive to the data center. We spent a few minutes diagnosing the issue (firewall) to ensure it was indeed the problem before replacing.”

“Do you know if the outage was inside the facility? A Properly designed datacenter shouldn’t have that big of an outage hit for a utility line fail. Even then, is there A+B Power?”

Ugh, you’re right, but this is the Fremont datacenter. I’ve also had a bad experience here with power being down where a generator completely failed to kick in when it should have. This wiped out the entire Linode deployment in May 2015 ( see: https://status.linode.com/incidents/2rm9ty3q8h3x ), and I’ll never forget that very long night of waiting for things to come back online after 4-5 hours of downtime. It was nuts. Definitely one of the worst outages I remember.