Human Error: The Plague of your Network

For the network team of any large enterprise, nothing is dreaded more than the post-mortem meeting after a big data center outage. These meetings are especially loathsome when an outage is the result of human error. Here at Forward Networks, we are deeply focused on helping our customers eliminate the outage-inducing human error that plagues their networks. This is a complex and largely unsolved problem.

Despite advances in data center automation, the vast majority of networking infrastructure still relies on humans to build and maintain. And where there is human involvement, invariably there is human error. The advisory group Uptime Institute claims that roughly 70% of data center failures are caused by some form of human error. This is a startling statistic when combined with a separate Infonetics Research survey finding that large companies lose as much as $100 million per year to such outages. This is a tremendous amount of potentially avoidable cost.

Even the most skilled network engineer can mistype a filter list, fat-finger an IP address, or misconfigure a trunk. These errors can have a cascading effect on the behavior of hundreds or thousands of network devices (i.e. switches, routers, load balancers, and firewalls). Sometimes, a far more complex error can lay dormant for months until some triggering event exposes the error and causes the network to fail. When an outage does occur, network teams must conduct a box-by-box search to find the root cause and remediate the problem as quickly as possible.

From the simplest to the most complex of cases, human error in a data center’s network config can have absolutely disastrous business consequences. So how can organizations take steps to eliminate it? Allow me to explain:

In the world of software, developers constantly unit test each new line of code or module they are writing to ensure it behaves correctly and meets intended design. They also routinely run integration tests to determine whether or not any new code module will break the logic of the remaining codebase.

In the field of networking, however, this type of testing simply doesn’t exist. The rules that determine how packets are processed and forwarded in a network are distributed across different vendor hardware which are all written in archaic command line interface (CLI) formats each with their own forwarding tables, filter rules, and other policies. When a network engineer or operator needs to verify a config or debug an issue, they are forced to use ancient tools like ping, traceroute, and netflow which are poorly suited to diagnose large and complex networks.

In most cases, when a network team needs to make a configuration change, they must first test that change in a separate test environment designed to replicate the production equipment and traffic behavior. This is resource intensive, time consuming, and really expensive. Moreover, there is no assurance that a test network is a 100% perfect representation of the production environment’s behavior and flow.

When a change is required immediately, it is typically rolled into the production network without prior testing. The network team has zero guarantee that this unverified change won’t trigger device failures and some catastrophic outage. Furthermore, they must bear full responsibility from upper management for the consequences of any untested change. While an untested change would be anathema to a software developer, it is a common reality for network engineers and operators.

Organizations may also choose to use a network emulator or an open-sourced simulation tool in an to attempt to replicate the network and validate changes. But these tools lack sufficient scalability and contextual understanding of a large network’s forwarding behavior and traffic flow. There simply isn’t an easy way to verify network correctness at scale across all vendor hardware. Equally, there isn’t a quick, efficient, and reliable way to test a network configuration change before being rolled into production.

For two and half years, the team here has been building a breakthrough platform designed to make enterprise networks as testable as software code. This may sound simple in concept, but it has been monumentally difficult to make a reality. Our team is staffed with 7 PhDs and some of the brightest networking talent from Google, Facebook, Microsoft, Apple, and Cisco. The product they have built is a remarkable feat of engineering. Our platform gives customers a complete and behaviorally accurate copy of their production network in software.

With the Forward Networks platform, organizations can move from a reactive and largely ad hoc approach to network management to one that is focused on proactive error detection, correction, and prevention. Whether it is a simple MTU misconfiguration or a more complex routing loop, we aim to eliminate the human error and configuration issues that plague networks and lead to costly data center outages.

I can’t wait to show you the product in action when we officially launch later this year. Until then, feel free to sign up and become a beta customer to the Forward Networks platform here.