Author: Daniel Havlik

In the coming week we will deploy an extensive OS update to our production environment which (right now) currently consists of 41 physical hosts running 195 virtual machines.

Updates like this are prepared very carefully in many small steps using our development and staging setups that reflect the exactly same environment as our production systems in the data center.

Nevertheless, we learned to expect the unexpected when deploying to our production environment. This is why we established the one/few/many paradigm for large updates. The remainder of this post talks about our scheduling mechanism to determine which machines are updated at what point in time.

Automated maintenance scheduling

The Flying Circus configuration management database (CMDB) keeps track of times that are acceptable to each customer for scheduled maintenance. When a machine determines that a particular automated activity will be disruptive (e.g. because it makes the system temporarily unstable or reboots) then it requests a maintenance period from the CMDB based on the customers’ preferences and the estimated duration of the downtime Customers are then automatically notified what will happen at what time.

This alone is too little to make a large update that affects all machines (and thus all customers) but it’s the mechanical foundation of the next step.

Maintenance weeks

When we roll out a large update we rather add additional padding for errors and thus we invented the “maintenance week”. For this we can ask the CMDB to proactively schedule relatively large maintenance windows for all machines in a given pattern.

Here’s a short version of how this schedule is built when an administrator pushes the “Schedule maintenance week” button in our CMDB (all times in UTC):

Monday 17:00 – internal test machines (our litmus machines) and a small but representative set of customer machines that are marked as test environments get updated

Tuesday 17:00 – the remainder of customer test machines, up to 5% of untested production VMs, and 20% of the storage servers are updated

Wednesday 17:00 – 30% of the production VMs get updated and 30% of the storage servers are updated

Thursday 17:00 – the remaining production VMs and storage servers get updated

Saturday 09:00 – KVM hosts are updated and rebooted

Saturday 13:00 – the second router is updated

Once the schedule has been established, customers are informed by email about the assigned slots. An internal cross-check ensures that all machines in the affected location do have a window assigned for this week.

This procedure causes the number of machines that get updated rise from Monday (22 machines) to Thursday (about 100 machines). Any problems we find on Monday we can fix on a small number of machines and provide a bugfix to avoid the issue on later days completely.

However, if you read the list carefully you are probably asking yourself: Why are customer VMs without tests updated early? Doesn’t this force customers without tests to experience outages more heavily?

Yes. And in our opinion this is a good thing: First, in earlier phases we have smaller numbers of machines to deal with. Any breakage that occurs on Monday or Tuesday can be dealt with more timely than if unexpected breakage occurs on Wednesday or Thursday where many machines are updated at onces. Second, if your service is critical then you should feel the pain of not having tests (similar to pain that you experience if you don’t write unit tests and try to refactor). We believe that “herd immunity” will give you a false sense of security and rather have unexpected errors occur early and clearly visible so they can be approached with a good fix instead of hiding them as long as possible.

We’re looking forward to our updates next week. Obviously we’re preparing for unexpected surprises, but what will they have in stock for us this time?

We also appreciate feedback: How do you prepare updates for many machines? Is there anything we’re missing? Anything that sounds like a good idea to you? Let us know – leave a comment!

Alex and I are using this time of the year when most of our colleagues are still on holidays to perform maintenance on our office infrastructure.

To prepare for all the goodness we have planned for the Flying Circus in 2013 we decided to upgrade our internet connectivity (switching from painful consumer-grade DSL/SDSL connections to fibre, yeah!) and also clean up our act in our small private server room. For that we decided to buy a better UPS and PDUs, a new rack to get some space between the servers and clean up the wiring.

Yesterday we prepared the parts we can do ourselves in preparation of the electricians coming in on Friday to start installing that nice Eaton 9355 8kVA UPS.

So, while the office was almost empty two of us managed to use our experience with the data center setups we do to turn a single rack (pictures of which we’re too ashamed to post) into this:

Although the office was almost abandoned, those servers to serve a real purpose and we had to be careful to avoid too massive interruptions as they do handle:

our phone system and office door bell

secondary DNS for our infrastructure and customer domains

chat and support systems

monitoring with business-critical alerting

Here’s how we did it:

Power down all the components that are not production-related and move them from the existing rack (right one on the front picture) to the new one. For that we already had our rack logically split between “infrastructure development” and “office business” machines.

Move the development components (1 switch, 7 servers, 1 UPS) to the new rack. Wire everything again (nicely!) and power it up. Use the power-up cycle to verify that IPMI remote control works. Also notice which machines don’t boot cleanly (which we only found on machines that are under development regarding kernels anyway, yay).

Notice that the old UPS isn’t able to actually run all those servers’ load and keep one turned off until we get the new UPS installed.

Now that we had space in the existing rack we re-distributed the servers there as well to make the arrangement more logical (routers, switches, and other telco-related stuff at the top). Turn off single servers one-by-one and keep everyone in the office informed about short outages.

Install new PDUs in the space we got after removing superfluous cables. Get lots of scratches while taking stuff out and in.

Update our inventory databases, take pictures, write blog post. 🙂

As the existing setup was quite old and grew over time we were pretty happy to be able to apply the lessons we learned in those years in between and get everything cleaned up in less than 10 hours. We notice the following things that we did differently this time (and have been doing so in the data center for a while already):

Create bundles of network cables for one server (we use 4) and put them in a systematic pattern into the switch, label them once with the servername at each end. Colors indicate VLAN.

Use real PDUs both for IEC and Schuko equipment. Avoid consumer-grade power-distribution.

Leave a rack unit between each component to allow operating without hurting yourself, the flexibility to pass wires (KVM) to the front, and to avoid temperature peaks within the rack.

Having over-capacity makes it easier to keep things clean which in turn makes you more flexible and brings ease to your mind to focus on the important stuff.

As the pictures indicate we aren’t completely done installing all the shiny new things, so here’s what’s left for the next days and weeks:

Wait for the electricians and Eaton to install and activate our new UPS.

Wire up the new PDUs with the new UPS and clean up the power wiring for each server.

Wait for the telco to finish digging and putting fibre into the street and get their equipment installed so we can enjoy a “real” internet connection.

All in all, we had a very productive and happy first working day in 2013. If this pace keeps up then we should encounter the singularity sometime in April.

We have been busy in the last months to improve the presentation of our hosting and operations services a lot – and if you attended the Plone Conference in Arnhem, you may have noticed some bits and pieces already: T-Shirts, nice graphics, a new logo, etc.

When pondering how to name our product we quickly decided that just using the old “gocept.net” domain wasn’t good enough. As we are also ambivalent about the whole “cloud hype” we were looking for something else: something specific, something with technology, something where people who know their trade do awesome stuff, something not for the fearsome but for people with visions and grand ideas.

What we found was this:

We call it the “Flying Circus” – for fearless man doing exactly what is needed to boost the performance, security, and reliability of your web application!

All this is just getting started and we will show a lot more at the PyConDE next week. Or, if you can not make it there, register for more information on flyingcircus.io!

Background: we had terrible support experiences with DELL over the last 4-5 years or so and I just had a single really good one today. We started moving slowly to a different vendor and won’t change our decision because of this one experience.

Our situation: we are currently fighting a subtle issue in or data center: spontaneous reboots of physical servers. It only happens rarely but is a bit of an annoyance. We have now experienced 10 cases over the last year and starting to investigate. The problem is that almost all machines rebooted only once and we can never find an actual cause.

While getting an overview of all restarts (machine, time, hardware model, role, bios version) we had to contact DELL ProSupport to figure out a contradictory statement on new BIOS versions.

First, I got directly to the technician and he actually (for once) did have our machine’s service tag on his desk. I explained to him that I needed a specific piece of information and that I’m currently investigating a broader issue that doesn’t seem to be related to a single machine. He took up on that, passed me the information and followed me building and correcting our model of the fault and gave helpful comments and additional data from their experiences in the support with those machines.

What I wondered about is that he gave me information which I expected to be one of the selling points of DELLs machines: management features, access to support experience instead of scripted/technologically challenged call-center Zombies. Again: kudos to the supporter who helped me today.

Here are the positive surprises:

The DELL R610 and R510 iDRAC express cards have SSH and WEB UIs for accessing some of the fancier features. I even finally found the power meter!

There seems to be a tool called “repository manager” which can create a bootable ISO that includes all firmware updates for all the machines that you select. Cool! However, it seems to need Windows 2008 (WTF?). Even on Windows

Maybe (I didn’t understand this fully) the lifecycle controller can perform all required firmware/BIOS updates via FTP directly when entering it during boot time. (Unfortunately you need to reboot just to find out whether you need updates.)

Recapitulating this phone call and the information I got, I reached some conclusions:

Big, big personal thanks to the DELL supporter, you made my day! (And you know who you are!)

Why do I get huge amounts of stupid manuals that I just through away but readable, accessible information that the iDRAC Express has HTTP and SSH support?

Why are all Linux updates for no reason wrapped into binaries that require Red Hat stuff? All the tools are there on other distributions. Can you please release things so that grown-ups can use them?

Can we please have an accessible, platform-independent way to retrieve the information whether firmware updates are pending? Aand whether any update in the chain is urgent?

I see myself confirmed that hardware vendors are just terrible at software. Even your supporter is trained by now to think that having to hit a button twice isn’t a bug but a feature. Come on!

We knew that the express cards do not support VGA redirection (we use ipmi sol generally) but that leaves AFAICT only the “mount a remote disk” and “redirect VGA” as features of the bigger iDRAC option. And that thing AFAICT costs around 300 EUR more.

Given the issues of how to update firmware if you are on a true free platform then I wonder why those cost extra. Seems like DELL does support MS and RedHat’s business model by forcing customers into those options.

Lastly, it’s nice to have an actually good experience with DELL support for once, but, given our overall experience we’re more than happy to be migrating to Thomas Krenn now.