This is our regular monthly post to keep you up to date with the latest releases to our server monitoring product, Server Density.

Latest value widget

A new widget is available on the dashboard which will show you the latest, current value for any metric. It will also display the average value over the time period the dashboard is configured for e.g. the 24 hour average or 1 hour average, with a sparkline graph in the background.

New official plugins for entropy, inodes, ProFTP, Zombies and Zookeeper

We’re in the process of retiring our old plugin directory and rewriting many old community plugins into officially supported and updated versions. These are available on Github and we’re accepting pull requests for improvements and changes, as well as brand new plugins.

The goal is to make it easier to install by just dropping the file into your agent plugin directory and ensure these plugins are kept up to date and fully supported by us.

New API documentation + Dashboard API

We’ve updated and expanded our API documentation with a new template and example calls for Python, Ruby and Curl.

What’s coming next?

Over the last few months we have been working on moving our alerts processing backend from Celery + MongoDB to Storm + Kafka, which sets the foundations for a range of new alerting functionality we’ll be releasing from March. Tagging is a key part of this functionality, which was released in December. Before then, we’ll be releasing more plugins and full process lists within the UI.

The last few months of development at Server Density have been focused on a large number of small improvements, particularly targeted at fixing known issues with the dashboard and ensuring we tackle lots of minor bugs and complaints. This means you’ll find the dashboard is more solid, performance across the whole app has improved and we have resolved bugs with our cross browser support on Firefox, Chrome and Safari.

We also added some new functionality which sets the foundations ready for some major releases in 2015:

Permissions & tagging

You may want to allow different teams or customers access to your account but restrict them to only be able to view and manage specific servers or web checks. Using tags and permissions you can now do this. Tag a device or web check and then tag an associated user, and that user will then only be able to access those specific devices or web checks. There’s a guide here.

Tagging is currently only used for permissions but is the foundation for more tag based functionality such as alerting on tags which will be released at the start of 2015.

New official plugins for MongoDB, Docker, Nginx, Nagios and Temperature

We’re in the process of retiring our old plugin directory and rewriting many old community plugins into officially supported and updated versions. These are available on Github and we’re accepting pull requests for improvements and changes, as well as brand new plugins.

The goal is to make it easier to install by just dropping the file into your agent plugin directory and ensure these plugins are kept up to date and fully supported by us.

Bandwidth aggregation calculations

We’ve created a new tool – sd-bw – which uses our API to aggregate bandwidth statistics for your servers (either individually or for every server in a group) to give you a “total amount transferred” or “total bandwidth used” figure for a specific timerange.

Vertically resizable dashboard graphs

You can now resize dashboard graph widgets both horizontally and vertically. Just hover over the bottom or right side of a widget, then click and drag to resize.

What’s coming next?

The beginning of 2015 will have a series of feature releases including enhancements to alerting, process list monitoring and better integration into cloud provider APIs e.g. CloudWatch metrics. If you have any ideas for improvements, let us know!

A common misconception in the industry is the notion that open source monitoring software is free. This is true if you’re looking at licensing alone, but there’s infinitely more factors to take into account than that. Being a great Nagios alternative, we decided to work out exactly how expensive Nagios is in comparison to our own server monitoring.

Server Density’s job as a competitor is to highlight some of the problems and difficulties of using Nagios, without damning the open source community and misleading anyone. Anticipating some critique of our calculations we’ve decided to write this article on our ‘workings’. It also gives you the ability to engage with us about the calculations – please comment below if you think we’ve gone wrong. If you can convince us, we’ll happily amend our math. For now though, here’s how we’ve worked it out.

Before you start

You’ll notice a common theme across many of these headings being the time Nagios takes to setup and use. In a world where time and money are completely unrelated, this is how the relationship between Server Density and Nagios looks:

Nagios saves you money.

Server Density saves you time.

Or

Nagios costs you time.

Server Density costs you money.

But of course this isn’t true, the age old idiom “time is money” couldn’t be more applicable to the world of fast moving tech startups, so:

Nagios costs you money.

Server Density costs you money.

Those principles form the basis of our Nagios cost calculator, to which we’ve created a monetary value for Nagios based on the time you’d expect to take setting up and maintaining the open source tool. You can evaluate the cost of a basic monitoring setup if you’d like, but if you need to replicate our monitoring infrastructure it’s best to keep all of the options ticked.

Nagios Cost Calculator

Nagios Hardware Requirements

A Nagios server isn’t cheap to run, they require a large amount of processing power, especially if you have a lot of servers:

Under 50 servers

To monitor anything under 50 servers we suggest something similar to the Amazon m3.medium instance type. At the time of writing (Nov, 2014) that’ll set you back $0.070 an hour, which totals to a yearly cost of $613.

Over 50 servers

Monitoring more than 50 servers will demand more from your Nagios server, so you’ll need to upgrade. For this we’d suggest an m3.xlarge instance. At the time of writing (Nov, 2014) that’ll set you back $0.280 an hour, which totals to a yearly cost of $2452.

We’ve used AWS as the cost benchmark as they’re constantly pushing costs down and are the most popular provider. We didn’t consider reserved instances because they add some complexity to calculating the cost due to the pre-purchase fees, which add to the overall setup cost of Nagios you don’t get with Server Density.

Need redundancy?

If you take monitoring seriously you’ll want to keep redundancy checked. To replicate how Server Density is deployed with full redundancy within our data centers combined with geographic redundancy of deploying into multiple facilities (your monitoring needs to be more reliable than what you’re actually monitoring!) you’ll need to have at least 2 servers each across 2 data centers. In the case that you’re monitoring under 50 servers that’s $613 * 4, if you’re monitoring over 50 it’s $2452 * 4.

This level of redundancy is necessary to ensure you can survive the failure of a node within one facility as well as the failure of the entire data center. Of course, this assumes you know how to set up Nagios in a redundant, load balanced cluster.

Once you get over 50 servers then it’s totally unacceptable to be running just a single Nagios server, so we forced this above 50 servers with our Nagios calculator.

How long does Nagios take to setup?

We’ve calculated the initial monitoring setup to take 2 working days. This can be shorter if you know what you’re doing or longer if you’ve never done it before. This is because it takes time to go through the installation process and in particular, get the initial config right.

How long does Nagios take to deploy across multiple server?

Once you’ve spent the 16 hours setting your Nagios server up, you still need to consider how long it takes to install the monitoring agent(s). There’s no shortage of config files when you’re running Nagios. It’s usually the initial setup that takes the longest, with each additional server only taking a few minutes to get up and running.

Nagios alerts configuration

Monitoring alerts need to be reliable and flexible. By default, Nagios limits alert delivery to email so it takes extra time to set up SMS alerts, or configure push notifications on your phone, plus the services you’ll want to use are often not free. SMS gateway reliability is important and with push notifications you need apps, or some 3rd party that supports generic notifications. Again, reliability has to be monitored. With Server Density, all of this is taken care for you at no extra cost. Even down to free SMS credits.

As part of the Nagios cost calculator, we estimate that setting up an alerting system that compares to the one we offer will take 8 hours of your time and have ignored the cost of using the external service such as for the SMS credits.

Nagios Graphing

It will take you a further 8 hours to install a plugin like nagiosgraph or configure an entirely separate system such as Cacti or Graphite – and even then, here’s the same data presented by Nagios and Server Density:

Nagios Security

Keeping everything nice and secure is essential. It takes time to get some basic hardening on any server and we’ve budgeted a couple of hours for this. What we don’t include is ongoing security assessments and patches that we take care of for you with Server Density. This is particularly important if a piece of software is installed on every single one of your servers or is a key part of your systems…such as monitoring.

Monitoring your monitoring

With no redundancy set up then you’re going to struggle to monitor the performance of your Nagios server without 2 to monitor each other. In the instance of no redundancy, you’ll need to use a service like Server Density on our 1 server plan to make sure everything is okay with your single Nagios server.

Nagios Maintenance

By default our calculator is set to allow for 12 hours of maintenance every year. That’s one hour a month fiddling with preferences, tweaking configs, fixing problems, upgrading or even thinking about improvements to your monitoring setup.

Incident management

We assume you to spend 6 hours every year (30 minutes a month) on incidents relating to your Nagios monitoring servers. This could be a hardware failure, instance retirements, whole region/data center reboots, instance upgrades, dealing with backups or clearing out metrics data from disk space.

Worldwide locations for web checks

If you want availability monitoring, then your best bet is to pay for an external provider like Server Density or Pingdom. On which, a ’50 checks’ account will cost ~ $250/year (as of Nov 2014).

Setting up geographically dispersed monitoring locations and scheduling checks amongst them all is non-trivial, and is something you get as part of the product with Server Density.

Most of the time the calculator settings are defaults and can be changed based on how long you’d consider things to take you. We have tried to be fair to Nagios with our time estimations, because after all cost isn’t the only way we think we have an advantage over the open source competition. There are some cases when Nagios is cheaper (e.g. if you don’t value your time highly or with tiny numbers of servers…but then why are you setting up a complex monitoring tool like Nagios in the first place?!) but with all the functionality Server Density provides, we think we have a pretty good offer!

Thanks for taking the time to read through our justification, if you’d like to join the discussion please leave a comment below, or equally this reddit thread is home to some interesting comments – we love to reading and responding to your thoughts. Oh, and if you’re sick of Nagios, consider us next time you’re looking for a Nagios Alternative.

Outages and downtime are inevitable. Designing your systems to handle failure is a key part of modern infrastructure architecture which makes it possible to survive most problems, however there will be incidents you didn’t think about, software bugs you didn’t catch and other events which result in downtime for your service.

Microsoft, Amazon and Google spend $billions every quarter and even they still have outages. How much do you spend?

There are some companies who constantly seem to have problems and suffer from it unnecessarily. Regular outages ultimately become unacceptable but if you adopt a few key principles and design your systems properly, the few times when you do have service incidents you can be forgiven by customers.

Step 1: Planning

If critical alerts result in panic and chaos then you deserve to suffer from the incident! There are a number of things you can do in advance to ensure that when something does go wrong, everyone on your team knows what they should be doing.

Use proper config management, be it Puppet, Chef, Ansible, Salt Stack or some other systems to be able to make mass changes to your infrastructure in a controlled manner. It also helps your team understand novel issues because the code that defines the setup is easily accessible.

Unexpected failures

Be aware of your whole system. Unexpected failures can come from unusual places. Are you hosted on AWS? What happens if they suffer an outage and you need to use Slack or Hipchat for internal communication? Are you hosted on Google Cloud? What happens if your GMail is unavailable during a Google Cloud outage? Are you using a data center within the city you live in? What happens if there’s a weather event and the phone service is knocked out?

Step 2: Be ready to handle the alerts

Some people hate being on call, others love it! Either way, you need a system to handle on call rotations, escalating issues to other members of the team, planning for reachability and allowing people to go off-call after incidents. We use PagerDuty on a weekly rotation through the team and consider things like who is available, internet connectivity, illness, holidays and looping in product engineering so issues waking people up can be resolved quickly.

Step 3: Deal with it, using checklists

Have a defined process in place ready to run through whenever the alerts go off. Using a checklist removes unnecessary thinking so you can focus on the real problem, and ensures key actions are taken and not forgotten. Have a channel for communication both internally and externally – there’s nothing worse to be the customer of a service that is down and you have no idea if they’re working on it or not.

Step 4: Write up a detailed postmortem

This is the opportunity to win back trust. If you follow the steps above and provide accurate, useful information during the outage so people know what is going on, this is the chance to write it up, explain what happened, what went wrong and crucially, what you are going to do to prevent it from happening again. Outages highlight unknown system flaws and it’s important to tell your users that the hole no longer exists, or is in the process of being closed.

Apache is perhaps the most well known and widely deployed web servers, having originally been released back in 1995 and currently deployed on a large number of web servers (although losing ground to NGINX). As an important part of the classic LAMP stack, it is a critical component in your web serving architecture, and if you’re not currently – you should be monitoring Apache.

Enabling Apache monitoring with mod_status

Most of the tools for monitoring Apache require the use of the mod_status module. This is included by default but needs to be enabled. You will need to specify an endpoint in your Apache config:

<Location /server-status>
SetHandler server-status
Order Deny,Allow
Deny from all
Allow from 127.0.0.1
</Location>

This will make the status page available at http://localhost/server-status on your server. We have a full guide to configuring this. Be sure to enable the ExtendedStatus directive to get full access to all the stats.

Monitoring Apache from the command line

Once you have enabled the status page and verified it is working above, you can make use of command line tools to monitor the traffic on your server in real time. This is useful for debugging issues and examining traffic as it happens.

The apache-top tool is a popular method of achieving this. It is often available as a system package e.g. apt-get install apachetop but can also be downloaded from the source, as it is only a simple Python script.

Apache monitoring and alerting – Apache stats

Using apache-top is useful for real time debugging and examining what is happening on your server right now, but it is less useful if you want to collect statistics over a period of time. This is where a monitoring product such as Server Density will come in. Our monitoring agent supports parsing the Apache server status output and can give you statistics on requests per second and idle/busy workers.

Apache has several process models but the most common is to have worker processes running idle waiting to service requests. As more requests come in then more workers will be launched to handle them, up to a configured maximum. At that point the requests will be queued and your visitors will experience delays. This means it’s important not just to monitor the raw requests per second but also how many idle workers you have.

A good way to approach configuring Apache alerts is to understand what kind of baseline traffic your application experiences and set alerts around this e.g. alert if the stats are significantly higher (indicating a sudden traffic spike) and if the values are suddenly significantly lower (indicating a problem preventing traffic somewhere). You could also benchmark your server to find out at what traffic level things start to slow down and the server becomes too overloaded – this will then act as a good upper limit which you can trigger alerts at too.

Apache monitoring and alerting – server stats

Monitoring Apache stats like requests per second and worker status is useful to keep an eye on Apache itself, but its performance will also be affected by how overloaded the server is. Ideally you will be running Apache on its own dedicated instance so you don’t need to worry about contention with other applications.

Web servers are generally limited by CPU and so your hardware spec should offer the web server as many CPUs and/or cores as possible. As you get more traffic then you will likely see the CPU usage increase, especially as Apache workers take up more CPU time and are distributed across the available CPUs and cores.

CPU % usage itself is not necessarily a useful metric to alert on because the values tend to be per CPU or per core and you may have many cores. It’s more useful to set up monitoring on average CPU utilisation across all CPUs or cores. Using a tool such as Server Density, you can visualise this and configure alerts so you can be notified when the CPU is overloaded – our guide to understanding these metrics and configuring CPU alerts will help.

On Linux this average across all CPUs is abstracted out to another system metric called load average. It is a decimal number rather than a percentage and allows you to understand load from the perspective of the operating system i.e. how long processes are waiting for access to the CPU. The recommended threshold for load average therefore depends on how many CPUs and cores you have – our guide to load average will help you understand this further.

Monitoring the remote status of Apache

All of the above metrics monitor the internal status of Apache and the servers it is running on but it is also important to monitor the experience your users are getting too. This is achieved by using external status and response time tools – you want to know if your Apache instance is serving traffic from different locations around the world (wherever your customers are) and the kind of response time performance. You will then know at what stage you need to add more capacity, either by increasing the capabilities of the Apache server or by adding more into a load balanced cluster.

This is easy to do with a service like Server Density because of our in-built website monitoring. You can check the status of your public URLs and other endpoints from custom locations and get alerts when performance drops or there is an outage.

We’ve been a bit quiet over the last few months but have still been working on improvements and new functionality to our server and website monitoring product, Server Density. This post summarises what we added over the summer and what’s coming up soon.

Log search beta

One of the first things you do when responding to an alert or tracking down performance problems is look at the server logs. Current log management products are expensive and complex to set up, so we’re pleased to announce the beta of our log search functionality.

Log search uses the existing Server Density agent to tail your logs and make them searchable from within your account. There’s a new dedicated search view so you can search by device, or you can view the logs from individual device views. Later, logs will automatically be displayed as part of a new, upcoming alert incident view.

Snapshots

Click on any data point on your device graphs and then click the Snapshot link, and it will take you through to a view of what was happening on that server at that exact point in time. You can also click the Snapshot tab to go to the latest snapshot and then navigate backwards and forward through each time point.

Linux agent 1.13.4

A number of fixes have been released as part of the latest Linux agent release, including better handling of plugin exceptions and more standards compliance for init scripts. MongoDB over SSL is also now supported. See the release notes.

Puppet module improvements

App performance improvements

A lot of work has been done behind the scenes to improve the performance of the product generally. This ranges from optimising requests and connections in the UI, upgrades to the hardware powering the service to moving all our assets onto a CDN. We have a few more improvements still to release but this all goes towards our goal of having response times as close to instantaneous as possible.

Onboarding and help popups

We retired our old app tour with new in-app popup bubbles to help you learn more about functionality. Blank slates have been redesigned and we have more improvements to help show off some of the great functionality coming soon.

Redesigned multi factor authentication setup

The flow for setting up a new multi factor token has been redesigned to make it clearer how to proceed through. We highly recommend enabling this for extra security – passwords are no longer enough!

Improved cloud actions menu

Actions taken within Server Density are separated from actions taken on the Cloud Provider level to ensure commands aren’t sent accidentally.

Delete confirmations

Previously it was too easy to take the delete actions which could lead to accidentally deleting a device. We’ve improved the confirmation requirements for this.

Auto refreshing graphs

All graphs, on the device overview and on the dashboard, now auto refresh so you can keep the window open and see the data show up immediately.

What’s coming next?

We’ll be returning to our monthly post schedule for “What’s new” as we start releasing some of the things we’ve been working on over the last few months. This includes permissions and a range of new alerting functionality, starting with tag based alerts and group recipients. Lots of interesting new functionality to be announced before the end of the year!

Today we’re releasing the Server Density integration into Google Compute Engine as an official Google Cloud Platform Technology Partner. Server Density works across all environments and platforms and is now fully integrated into Google’s cloud infrastructure products, including Compute Engine and Persistent Disks, to offer alerting, historical metrics and devops dashboards to Google customers.

Server Density customers can connect their Google Cloud accounts to automatically monitor and manage instances across Google data centers alongside existing environments and other cloud providers. Many customers will run systems across multiple providers in a hybrid setup, so Server Density is uniquely placed to help with that because even though we have specialist integration into Google, it works well anywhere – cloud, hybrid and on-prem.

$500 credit for Google/Server Density customers

Server Density normally starts at $10/m to monitor Linux, Windows, FreeBSD and Mac servers but Google Cloud customers can monitor up to 5 servers for free for life (worth over $500/year). Google is also offering Server Density customers $500 in credits to trial Google Cloud Platform. To find out more and sign up, head over to our website for details.

September 9, 2014 2:16 pmLeave a CommentUpdate: We hosted a live Hangout on Air with Paul Done from MongoDB discussing how to monitor MongoDB. We’ve made the slides and video available, which can be found embedded at the bottom of this blog post.

All this means we keep a very close eye on how our MongoDB clusters are performing, with detailed monitoring of all aspects of the systems. This post will go into detail about the key metrics and how to monitor your MongoDB servers.

Key MongoDB monitoring metrics

There is a huge range of different things you should keep track of with your MongoDB clusters, but only a few that are critical. These are the monitoring metrics we have on our critical list:

Oplog replication lag

The replication built into MongoDB through replica sets has worked very well in our experience. However, by default writes only need to be accepted by the primary member and replicate down to other secondaries asynchronously i.e. MongoDB is eventually consistent by default. This means there is usually a short window where data might not be replicated should the primary fail.

This is a known property, so for critical data, you can adjust the write concern to return only when data has reached a certain number of secondaries. For other writes, you need to know when secondaries start to fall behind because this can indicate problems such as network issues or insufficient hardware capacity.

Replica secondaries can sometimes fall behind if you are moving a large number of chunks in a sharded cluster. As such, we only alert if the replicas fall behind for more than a certain period of time e.g. if they recover within 30min then we don’t alert.

Replica state

In normal operation, one member of the replica set will be primary and all the other members will be secondaries. This rarely changes and if there is a member election, we want to know why. Usually this happens within seconds and the condition resolves itself but we want to investigate the cause right away because there could have been a hardware or network failure.

Flapping between states should not be a normal working condition and should only happen deliberately e.g. for maintenance or during a valid incident e.g. hardware failure.

Lock % and disk i/o % utilization

As of MongoDB 2.6, locking is on a database level, with work ongoing for document level locking in MongoDB 2.8. Writes take a global database lock so if this situation happens too often then you will start seeing performance problems as other operations (including reads) get backed up in the queue.

We’ve seen high effective lock % be a symptom of other issues within the database e.g. poorly configured indexes, no indexes, disk hardware failures and bad schema design. This means it’s important to know when the value is high for a long time, because it can cause the server to slow down (and become unresponsive, triggering a replica state change) or the oplog to start to lag behind.

However, it can trigger too often, so you need to be careful. Set long delays e.g. if the lock remains above 75% for more than 30 minutes and if you have alerts on replica state and oplog lag, you can actually set this as a non-critical alert.

Related to this is how much work your disks are doing i.e. disk i/o % utilization. Approaching 100% indicates your disks are at capacity and you need to upgrade them i.e. spinning disk to SSD. If you are using SSDs already then you can provide more RAM or you need to split the data into shards.

Non-critical metrics to monitor MongoDB

There are a range of other metrics you should keep track of on a regular basis. Even though they might be non-critical, they will help avoid issues escalating to critical production problems if dealt with and investigated.

Memory usage and page faults

Memory is probably the most important resource you can give MongoDB and so you want to make sure you always have enough! The rule of thumb is to always provide sufficient RAM for all of your indexes to fit in memory, and where possible, enough memory for all your data too.

Page faults are related to memory because a page fault happens when MongoDB has to go to disk to find the data rather than memory. More page faults indicate that there is insufficient memory, so you should consider increasing the available RAM.

Connections

Every connection to MongoDB has an overhead which contributes to the required memory for the system. This is initially limited by the Unix ulimit settings but then will become limited by the server resources, particularly memory.

High numbers of connections can also indicate problems elsewhere e.g. requests backing up due to high lock % or a problem with your application code opening too many connections.

Shard chunk distribution

MongoDB will try and balance chunks equally around all your shards but this can start to lag behind if there are constraints on the system e.g. high lock % slowing down moveChunk operations. You should regularly keep an eye on how balanced the cluster is.

Tools to monitor MongoDB

Now you know the things to keep an eye on, you need to know how to actually collect those monitoring statistics!

Monitoring MongoDB in real time

MongoDB includes a number of tools out of the box. These are all run against a live MongoDB server and report stats in real time:

mongostat – this shows key metrics like opcounts, lock %, memory usage and replica set status updating every second. It is useful for real time troubleshooting because you can see what is going on right now.

mongotop – whereas mongostat shows global server metrics, mongotop looks at the metrics on a collection level, specifically in relation to reads and writes. This helps to show where the most activity is.

rs.status() – this shows the status of the replica set from the viewpoint of the member you execute the command on. It’s useful to see the state of members and their oplog lag.

sh.status() – this shows the status of your sharded cluster, in particular the number of chunks per shard so you can see if things are balanced or not.

MongoDB monitoring, graphs and alerts

Although the above tools are useful for real time monitoring, you also need to keep track of statistics over time and get notified when metrics hit certain thresholds – some critical, some non-critical. This is where a monitoring tool such as Server Density comes in. We can collect all these statistics for you, allow you to configure alerts and dashboards and graph the data over time, all with minimal effort.

We make use of Puppet to manage all of our infrastructure and this acts as much of the documentation – what is installed, configuration, management of servers, dealing with failover and deploys – but there is still need for other written docs. The most important is the incident response guide, which is the step by step checklist all our on-call team run through when an alert gets triggered.

Why do you need an incident response guide?

As your team grows, you can’t just rely on one or two people knowing everything about how to deal with incidents in an ad-hoc manner. Systems will become more complex and you’ll want to distribute responsibilities around team members, so not everyone will have the same knowledge. During an incident, it’s important that the right things get done in the right order. There are several things to remember:

Log everything you do. This is important so that other responders can get up to speed and know what has been done, but is also important to review after the incident is resolved so you can make improvements as part of the postmortem.

Know how to communicate internally and with end-users. You want to make sure you are as efficient as possible as a team, but also keep your end-users up to date so they know what is happening.

Know how to contact other team members. If the first responder needs help, you need a quick way to raise other team members.

All this is difficult to remember during the stress of an incident so what you need is an incident response guide. This is a short document that has clear steps that are always followed when an alert is triggered.

What should you have in your incident response guide?

Our incident response guide contains 6 steps which I’ve detailed below, expanded upon to give some insight into the reasoning. In the actual document, they are very short because you don’t want to have complex instructions to follow!

Log the incident in JIRA. We use JIRA for project management and so it makes sense to log all incidents there. We open the incident ticket as soon as the responder receives the alert and it contains the basic details from the alert. All further steps taken in diagnosing and fixing the problem are logged as comments. This allows us to refer to the incident by a unique ID, it allows other team members to track what is happening and it means we can link the incident to followup bug tasks or improvements as part of the postmortem.

Acknowledge the alert in PagerDuty. We don’t acknowledge alerts until the incident is logged because we link the acknowledgment with the incident. This helps other team members know that the issue is being investigated rather than someone has accidentally acknowledged the alert and forgotten about it.

Log into the Ops War Room in Hipchat. We use Hipchat for real time team communication and have a separate “war room” which is used only for discussing ongoing incidents. We use sterile cockpit rules to prevent noise and also pipe in alerts into that room. This allows us to see what is happening, sorted by timestamp. Often we will switch to using a phone call (usually via Skype because Google Hangouts still uses far too much CPU!) if we need to discuss something or coordinate certain actions, because speaking is faster than typing. Even so, we will still log the details in the relevant JIRA incident ticket.

Search the incident response Google Docs folder and check known issues. We have a list of known issues e.g. debug branches deployed or known problems waiting fixes which sometimes result in on-call alerts. Most of the time though it is something unusual and we have documentation on all possible alert types so you can easily search by error string and find the right document, and the steps for debugging. Where possible we try to avoid triggering on-call alerts to real people where a problem can be fixed using an automated script, so usually these steps are debug steps to help track down where the problem is.

If the issue is affecting end-users, do a post to our status site. Due to the design of our systems, we very rarely have incidents which affect the use of our product. However, where there is a problem which causes customer impact, we post to our public status page. We try and provide as much detail as possible and post updates as soon as we know more, or at the very least every 30m even if there is nothing new to report. It seems counter-intuitive that publicising your problems would be a good thing, but customers generally respond well to frequent updates so they know when problems are happening. This is no excuse for problems happening too frequently but when they do happen, customers want to know.

Replying to customer emails

Another note we have is regarding support tickets that come in reporting the issue. Inevitably some customers are not aware of your public status page and they’ll report any problems directly to you. We use Zendesk to set the first ticket as a “Problem” and direct the customer to our status page. Any further tickets can be set as “Incidents” of that “Problem” so when we solve the issue, we can do a mass reply to all linked tickets. Even though they can get the same info from the status page, it’s good practice to email customers too.

What do you have in your playbook?

Every company handles incidents differently. We’ve built this process up over the years of experience, learning how others do things and understanding our own feelings when services we use have outages. You can do a lot to prevent outages but you can never eliminate them, so you need to spend as much time planning the process for handling them. What do you have in your incident response processes? Leave a comment!

Now that we’re seeing intense competition in the cloud infrastructure market, each of the vendors is looking for as many ways to differentiate itself as possible. Big wallets are required to build the infrastructure and picking the right locations to deploy that capital is becoming an important choice. Cloud vendors can be innovative on a product or technical level, but location is just as important — which geographies does your cloud vendor have data centers in and why does that matter?

Why is location important?

There are a number of reasons why a diverse range of locations is important:

Redundancy: Compared to the chances of a server failure, whole data center outages are rare — but they can happen. In the case of power outages, software bugs or extreme weather, it’s important to be able to distribute your workloads across multiple, independent facilities. This is not just to get redundancy across data centers but also across geographies so you can avoid local issues like bad weather or electrical faults. You need data centers close enough to minimize latency but far enough to be separated by geography.

Data protection: Different types of data have different locality requirements e.g. requiring personal data to remain within the EU.

User latency: response times for the end user are very important in certain applications, so having data centers close to your users is important, and the ability to send traffic to different regions helps simplify this. CDNs can be used for some content but connectivity is often required to the source too.

Deploying data centers around the world is not cheap, and this is the area where the big cloud providers have an advantage. It is not just a case of equipping and staffing data centers — much of the innovation is coming from how efficient those facilities are. Whether that means using the local geography to make data centers green, or building your own power systems, this all contributes to driving down prices, which can only truly be done at scale.

How do the top providers perform?

The different providers all have the concept of regions or data centers within a specific geography. Usually, these are split into multiple regions so you can get redundancy within the region, but this is not sufficient for true redundancy because the whole region could fail, or there could be a local event like a storm. Therefore, counting true geographies is important:

Azure is in the lead with 12 regions followed by Softlayer (10), Amazon (8) and Rackspace (6). Google loses out, with only 3 regions.

Where is the investment going?

It’s somewhat surprising that Amazon has gone for so long with only a single region in Europe — although this may be about to change with evidence of a new region based in Germany. If you want redundancy then you really need at least 2 data centers nearby, otherwise latency will pose a problem. For example, replicating a production database between data centers will experience higher latency if you have to send data across the ocean (from the U.S. to Ireland, say). It’s much better to replicate between Ireland and Germany!

What about China?

I have specifically excluded China from the figures above but it is still an interesting case. The problem is that while connectivity inside China is very good (in some regions), crossing the border can add significant latency and packet loss. Microsoft and Amazon both have regions within China, but they require a separate account and you usually have to be based in China to apply. Softlayer has announced a data center in Shanghai, so it will be interesting to see whether it can connect their global private network with good throughput. As for Google, it publicly left China 4 years ago so it may never launch a region there.

It’s clear that location is going to be a competitive advantage, one where Microsoft currently holds first place but will lose it to Softlayer soon. Given the amount of money being invested, it will be interesting to see where cloud availability expands to next.

Free DevOps articles and tutorials.

Can we get an air five?

Thanks for your support, you're awesome. Wondering who's behind all of this wordplay? It's the Server Density team, we're making a server monitoring tool that's simple to install, easy to use and mindblowingly comprehensive.