The most important aspect of nearly every network is availability. Performance, scalability, management, agility, etc. all require the network to actually be online. In conversations with Gartner clients, availability often comes up which is echoed in surveys we’ve done:

I would argue that availability is higher than 20%, but clients don’t score it that way because a) it is assumed as foundational to all vendors and hence is not perceived as a major differentiator and/or b) all the hype around SDN has people focused on agility and orchestration. Availability is relatively boring compared to other “cool” stuff like SDN and disaggregation, unless you’re talking about the Netflix chaos monkey (more on that below).

When talking to clients, availability often comes up after an outage. In some cases, network outages drive significant investment from IT, particularly in the DDI market (“..we could never cost justify DDI until we fat-fingered our public website A-record…”). The undisputed #1 cause of network outages is human error, with estimates as high as 32% according to Dimension Data’s 2014 Network Barometer report, not to mention a study from Avaya indicating 82% of folks experienced network downtime due to human error. In my 16+ years running large corporate networks, there was no feeling worse than the post-mortem meeting after a big outage. I’ll never forget one particular meeting in which my CIO said “…well that was just plain stupid…”. Fortunately, this only happened to me a very small number of times. And, we have research that can help you avoid networking outages, including:

Summary: While businesses invest in their networks to gain a competitive edge, they often fail to ensure adequate steps are taken to reduce outages. Gartner’s four-step NCCM approach enables network staff to minimize infrastructure failure.

Summary: In the developed world, the marginal cost of bandwidth is so low that rightsizing capacity has little impact on WAN cost. However, the cost of improving availability remains high and downtime is less acceptable, making rightsizing network availability the key goal for enterprise network designers.

Summary: Antifragile systems turn stress and adversity into advantage. Certain practices of Web-scale IT enterprises may be emulated by other IT organizations to enhance their antifragility, especially as part of their continual improvement, DevOps and digital business initiatives.

Regards, Andrew

PS – If you have a really good and/or funny outage story, feel free to include it the comments, a prize will be sent to the best one…

Andrew Lerner
Research Vice President 4 years at Gartner 19 years IT Industry

Andrew Lerner is a Vice President in Gartner Research. He covers enterprise networking, including data center, campus and WAN with a focus on emerging technologies (SDN, SD-WAN, and Intent-based networking). Read Full Bio

One of the IT admins was working on a batch file that upon login deleted these files. She decided to have me (the Intern) test it. So I walked over to the area where the semi-trucks check out (fleet maintenance) and logged in.

Within 30 seconds all the computers were down. The outage persisted for a few hours. Lots of unhappy truckers and IT admins. Backups from tape were involved. Not fun.

Their senior engineer finally figured it out. The script had been written to delete files from the f:/temp drive where tmp files were stored. However, the script had a) a directory switch, b) i had admin rights but no temp drive on that particular server.

I will say that management agility (ease of configuration), capacity, and performance all do have ties to availability. The number one cause of most issues, including network issues, is probably human error. If you decrease the complexity then you will have fewer human error problems, thus higher availability. In the same way, as DDoS attacks become more common, having more performance and capacity allows you to have better resiliency against such attacks, thus again, increasing availability.

I guess what I’m saying is that availability can be a concern as a part of those other areas. If you want better agility so you can reduce human error, or increased performance/capacity so you can handle larger attacks, then you’re really doing it for availability, even if that isn’t what you’re saying.

Karl, thanks. Great comment and I agree with you on both accounts. Simplifying the network reduces human error as well as increasing the capacity/performance (surface area). Matter of fact, I’ve written about both of these specifically…

[…] Network Downtime – Andrew Lerner – The most important aspect of nearly every network is availability. Performance, scalability, management, agility, etc. all require the network to actually be online. […]

[…] assume that most outages are out of their control, or malicious in nature, according to an article written by Gartner, “the undisputed #1 cause of network outages is human error, with estimates as high as 32% […]

[…] Kudos for getting beyond just “uptime” and focusing on one of the key root causes of network downtime, which is manual configuration error (often lead by undertrained staff). Also, I agree that […]

[…] In fact, studies show that majority of errors in a datacenter are human errors (see for example Network Downtime by Andrew Learner from Gartner). So do we really need professional services tweaking our […]

About

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.