Perspectives on IT Outages: What Do They Mean For Your Business?

Reliable uptime is key for ensuring smooth business processing and quality customer relationships. Without uptime, we can’t meet our customers’ changing needs and can’t keep track of what’s happening by the minute across our network. We lose sight—or even control—of the safety of our resources and assets. In short: downtime is scary.

But what are the more specific organizational and technical ramifications of extreme IT outages? Let’s hear from industry experts from different backgrounds as they explain why IT outages are so detrimental to a company’s holistic and continued success in the marketplace—and what you can do to prevent an outage at your company.

Robin Tatam, CISM, Director of Security Technologies, PowerTech

Any major IT outage, regardless of whether it comes as a result of a malicious cyber-attack, inadequate planning, or unfortunate oversight, can cost an organization tens of millions of dollars, which only adds insult to injury. Financial fallout may involve intangible losses, such as missed business opportunities and damaged marketplace reputation, in addition to a more tangible impact, like a massive, public corporate fine. A dramatic outage within governments, financial institutions, or infrastructure providers might even undermine economic stability or result in widespread panic.

From a security standpoint, many people mistakenly believe that compliance standards are only concerned with electronic data protection. Most regulatory mandates as we know them were borne out of dramatic, real-world failures. The goal of these mandates is to prevent reoccurrence: to force us to mitigate unacceptable risk by validating the deployment of effective safeguards that ensure organizations remain securely “open for business.” Servers must be available to serve data; data must be accurate and visible to the desired audience. Business resiliency comes from the development of and adherence to disaster recovery and high availability plans.

The last few years—which have been full of data breaches—should have taught us that we must now add security considerations to detect, prevent, and respond to illicit access from others. And thousands of organizations have also discovered—often painfully—that effective resiliency planning incorporates what to do when all else fails. As commonplace as it seems to be, emergency response should never be written when you’re already neck-deep in a major disaster.

Sadly, many enterprises wait until disaster strikes before allocating the necessary funds to securing their systems. Resiliency expenditure should be considered a business cost and budgeted for generously. Minimizing risk to IT infrastructure requires a proactive approach to hardware, software, and skilled staff resources. Of course, we are not able to anticipate every event, but most examples of failure are not the first of their kind. Learn from the mistakes and oversight of others rather than passively hoping that it doesn’t happen to you next. Who will eventually pay the price for these disasters? It’s obviously possible that a portion of the costs from a major data breach or outage may impact executive remuneration, but the primary brunt of increased expenditure and reduced profits is far more likely to result in a broad increase in the cost of living for all of us.

Tom Huntington, Vice President of Technical Services, HelpSystems

Many large corporations “choke” the costs out of running older technology on which they rely for everyday business tasks. Eventually, this old technology is on an unsupported OS level or hardware that is so out of date, the business cannot find reasonable support for it any longer. The project of removing this older technology and upgrading to a newer system is monumental. When they approach the project of upgrading to newer technology, organizations must account for the fact that critical business is still running on these legacy servers and proper backups and security and staying current on OS levels are critical pieces of upgrading the systems.

When such an industry—one that for better or worse still relies on older technology—tries to upgrade their systems, the failure to stay current on technology and the lack of proper DR plans often leads to unexpected or even disastrous consequences.

Richard Schoen, Director of Document Management Technologies, RJS

Outages happen often in response to upgrading mission-critical applications. Before you update your applications, it is crucial to test the solution in a test environment prior to implementation. Systems users of all business platforms can be prepared for issues related to system failure by making sure they have redundant disaster recovery and test environments and an annually or semi-annually tested plan to make sure their systems will survive a large-scale outage.

In some cases, a smaller business may not survive a large system failure if they have not done at least some level of system disaster preparation. In the HelpSystems world of systems management, security, business intelligence, and document management, this means making sure our customers are educated on the need for proper backup systems because of the 24x7 nature of many of their business systems. It also means they need to be thinking regularly about the repercussions if their core systems get compromised or go offline. Major outages that have been in the news recently underscore why all businesses need to regularly look at their IT infrastructure and consider what would happen if a key business system goes offline.

Kevin Jackson, Technical Solutions Consultant, InterMapper

Many outages are caused by rolling out an upgrade without having a rollback feature. It is common knowledge that you should run software in a parallel development environment where you can test and QA a product before applying it to production. There must be a fallback initiative in place in case an outage happens after upgrading, in which case you could roll the software back to a “last known good configuration.”

The financial industry, in particular, tends to run many important business applications on antiquated infrastructure. This doesn’t make much sense because financial institutions tend to have substantial IT budgets, which should be used for more regular infrastructure updates.

Here are some things that businesses that have faced major IT outages can do to prevent them from happening again:

Disaster Recovery plans

Point-in-time system backups

Risk management plans

QA in a development environment for testing

Network security and network management systems

Pat Cameron, Director of Automation Technologies, Automate Schedule

Downtime can be a painful experience for you and for your customers. When designing a production environment, especially one that is customer-facing, you should include a high availability option in your configuration. Admittedly, it will make your original installation more expensive, but it will quickly pay for itself if you experience problems on your production servers.

Enterprise job scheduling sounds like a pretty mundane area of IT operations these days but, if your job scheduler stops working, none of those end-of-day processes will complete. None of your ETL processes will complete. None of your file transfers will occur on time. And so on…Most business processes these days are driven by events that need to occur in a particular order and, if they don’t, small problems very quickly turn into critical problems and ultimately lead to downtime for your critical applications.

Learn from others’ mistakes: design your enterprise scheduler with redundancy from the start. Don’t wait until disaster occurs to add that HA server to your plan. Test, test, and test some more before installing updated software on your production systems. Use your job scheduler in your test environment to run those scripts until you are absolutely sure that the update will succeed. Use your job scheduler to back up that production system before any changes occur, as well. The more that you automate, the less chance there is for errors to occur.

Mike Stegeman, Senior Data Access Consultant, SEQUEL

Many outages occur because organizations wait too long between updates of their applications and hardware. After major public outages, it sometimes comes out in the news that an organization’s infrastructure was not adequate for today’s standards. This begs the question: was the infrastructure itself old? Or was the way in which it was being utilized out of date? Perhaps both?

It is also possible that IT departments charged with upgrading software infrastructure are not properly trained or staffed. There are solutions available to automate the upgrade process. For example, the new boxes for IBM i have the ability to roll seamlessly without the users knowing. An HA-type system that rolls over to the upgrade avoids the save/restore situation that can result in major outages. In any case, testing of upgrades and HA solutions is important, especially for public companies that affect and rely on public trust.

The Round-Up

Our experts have unique perspectives on IT outages based on their areas of expertise. However, many of the same lessons shine through in their responses. Here are some key things to remember as you plan for—and prevent—IT outages:

Add a high availability server to your environment as part of a risk management plan