Two of Your Nines Don’t Need Five Nines

There’s a pretty good chance that you have a number of environments that are tagged as having a five nines requirement for availability That don’t actually need it. We often find ourselves getting confused with availability and the true cost of keeping application environments up on the ever elusive 99.999% uptime.

Breaking Down the Nines

It is eye opening when we break down the actual downtime allowed in the sense of minutes per year when we look at the level of nines as listed on Wikipedia (https://en.wikipedia.org/wiki/High_availability):

So, when we think about the cost of maintaining an uptime level for the application, it becomes important to see the real numbers in relation to availability. The cost of achieving 99.999 versus 99 is significant.

Fine Nines, Nine to Five

This is the real situation that many of us need to deal with. While we talk about the criticality of an application environment, it’s often thought about as critical only when it’s in active use. Most folks would obviously look at a straight 99 percent uptime with 87 hours and balk at that as a suggested availability. Here’s the catch, though. What we are really looking for is a five nines availability, but only during access hours. Many, if not most, of our internal business applications are only accessed during the day inside office hours.

Even if we span across time zones, the reality is that we aren’t using the applications during a decent amount of time in the day. Assuming that your application needs to cover time zones that span a continent, you are probably needing to cover a 10 hour day with a maximum of a 5 hour variance, totaling 9 hours a day that it is not needed for primary use. That means that you can effectively sustain 2964 hours…yes hours…of downtime. That means 177,840 minutes.

Does this mean we can shut them off? Well, not quite. Let’s talk about why.

The Highly Used Unused Application

Applications are considered active during a certain window which I refer to as primary use. There is a non-primary use set of processes which happen as well.

Backing up the environment is a good example of this. Backups tend to run off hours so as not to collide with primary use inside business hours. Security scans and other safety practices also take place outside of core hours to help with keeping application performance more stable during primary use hours.

Because of this we do still have requirements to keep the application platforms available to run these other operational processes.

Scale-back Versus Shut Down

As your application environments are being architected or refactored, it is good to think about the importance of a microservices approach and why it can help with this issue. I know that there are assumptions around the fact that we are choosing when system availability occurs, but the important part of this discussion is that you may be paying for a surprising amount of warranty on systems that don’t need it.

We can see that we can’t really just power off servers a lot of the time because of the backups, security scans, and other non-primary use access. What we can do is to use a scale-out and scale-back approach.

Web applications may need the back-end to be continuously available but at different levels of usage. During off hours, why not have less front-end servers? Data layers can stay up, but can also be scaled down.

Some applications like file servers and less variable use applications will not do well in scale-up/scale-down scenarios. That’s ok. We have to accept that hybrid approaches are needed across all areas of IT.

Why is This Important?

Think ahead. When the architecture is being evaluated for production and disaster recovery, we should be thinking about primary use and availability as well as the non-primary use functions like data protection.

All of a sudden, those buzzwords like microservices and containers with infrastructure as code seem to make some sense. Should you be racing to refactor all of your apps? No. Should you be continuously evaluating the environment? Yes.

Most importantly, be aware of the true cost of the five nines and whether you really need it for all of your applications.

Interesting results from a Vision Solutions survey

I’m always watching for stats and numbers in the industry. We are continuously presented with the “fastest growing” and “x% better than the competitor” based on a number of sometimes skewed statistics. While I love what information I can gather from statistics for hardware and software, it is almost always based on sales.

When I was given some statistics recently by my friends at Vision Solutions, I really dug in because these numbers presented some interesting views on what is happening in the technology. Of course, I know that even these numbers may be open to a certain amount of interpretation, but hopefully you can read some of the same information that I have from it.

Consistency across the results

This survey was done using 985 respondents who ranged in company size. Here is the description of the survey participants:

The interesting thing is that we see information which is contrary to much of what is trending in the world of marketing IT solutions. It isn’t that all-flash, or public cloud, or any of the specific big trends aren’t real. What it does show is that there is a big representation of the “middle class” of technology consumers.

51% indicated that storage growth was between 10-30% per year

Replication features was nearly even at 39% for hardware and 35% for software

Tape isn’t dead – 81% of respondents use tape as part of their data protection solution

There are more details throughout, but these jumped out at me in particular.

Test the plan, don’t plan the test

A frightening number that came from the survey was that 83% of respondents had no plan, or were less than 100% confident that their current plan was complete, tested and ready to execute. One word reaction to this: Wow!

As Mike Tyson said: “Everyone has a plan until they get punched in the mouth”. Many people that I have spoken to have a long lead up to their BCP/DR test and a significant portion of the planning is one-time activities to ensure the test goes well, but just to satisfy the test, not necessarily to build a self-healing infrastructure which is where we should be working towards.

This is a clear sign in my mind that people are continually looking for resources and tools to build, or re-evaluate their BCP plan. Since I’m a Double-Take user, the combination of all this hits pretty close to home. I’ve been using products with online testing capability for a number of years which helps to increase the confidence for me and my team that we are protected in the event of a significant business disruption at the primary data center.

Enterprises love their pets

With years of work in Financial Services and enterprise business environments, I get to see the other side of the “pets versus cattle” debate which is the abundance of pets in a corporate data center. Sometimes I even think the cattle have names too.

Legacy application environments are a reality. Not every existing application has been or will be developed as a fully distributed n-tier application. There are a significant number of current and future applications that are still deployed in the traditional model with co-located servers, single instances, and other architectural challenges that don’t allow for the “cloudy” style of failure.

There is nothing wrong with this environment for the people who are mapped to this model today. Most organizations are actively redesigning applications and rethinking their development practices, but the existence of legacy products and servers is a reality for some time to come.

Evolving application and data protection

I’m a fan of Double-Take, so I’m a little biased when I see great content from Vision Solutions 🙂 What I take away from this is that there are a lot of us who may not have the ideal plan in place, or may not have an effective plan in place at all for a BCP situation. The content of seeing people’s preparation is only half of the story.

Having a plan is one thing, but seeing what the results of real data loss and the reason behind it is particularly important. Using manual processes is definitely a fast track to issues.

Beyond orchestration, the next step I recommend is using CDP (Continuous Data Protection) where possible. My protected content (servers, volumes and folders) are asynchronously replicated, plus I take daily snapshots for full servers and semi-hourly snapshots of file data. This ensures that multiple RPOs (Recovery Point Objectives).

In the event of a data corruption, the corruption would be immediately replicated…by design of the protection tool. Using a previous RPO snapshot prevents the risk of a total data loss by using an automated snapshot. Phew!

Ultimately, the onus is on us to enhance the plan, build the process, and evaluate the tools. If you want to find out more on how I’ve done data and server protection, please feel free to reach out to me (eric at discoposse dot com) and if you want to find out more on the Vision Solutions portfolio and Double-Take, you can go right to the source at http://www.VisionSolutions.com and there are some great whitepapers and resources there to help out.

The mighty chain of IT: Where five 9 uptime falls apart

It used to be said that if you want something done right that you have to do it yourself. True words, but unfortunately that only works if you are ready to maintain the entire scope of build, deploy, monitor and support all by yourself.

Earlier this week, GoDaddy.com suffered from an outage which highlighted some significant worries for many people. Whether you were one of the millions of sites hosted by GoDaddy, or one of the millions of customers who use GoDaddy DNS services, you were the unintended victim of a brutal situation.

Regardless of the fact that it was an unexpected attack by a member of the famed Anonymous hacker group, the end result was the same for all of those customers (me included); we realized that the five 9s uptime promise is ultimately on a best effort basis.

Today, prominent programming author and blogger @JeffHicks was on the recovery from a hacked Delicio.us account resulting in a Twitter blast of spam posts under his profile. While this doesn’t affect the uptime of any of Jeff’s services and sites, it speaks to the importance of the chain of IT.

Weakest Link

“A chain is only as strong as its weakest link”

We’ve all heard the phrase, quoted the phrase and seen the true result of it as well. After years of BCP/DR design and implementation I’ve had more than enough exposure to the SPOF (Single Point of Failure) concept, and the idea that interdependent systems reveal vulnerabilities that have to be understood.

If you were running your application infrastructure and counted on the GoDaddy 99.999% uptime “guarantee” you have now become the SPOF to your customer. It wasn’t your fault, nor could you have thought you needed to plan around it really. How much more that a five 9 uptime guarantee could you ask for.

LCD – Lowest Common Denominator

I wrote a series about BCP/DR geared towards the “101” crowd who may not have had exposure to a fully featured BCP program. In one of those posts I talked about how the Lowest Common Denominator is what you use to define the recoverability and reliability of your service.

As we evaluate our business and application infrastructure we have to understand every component that is involved to fully realize where we have exposure to failure or vulnerability.

Known knowns

Donald Rumsfeld had a great statement about what we know. This is what he said:

It’s a powerful statement and I’ve used it many times in presentations. I’ve been asked by management teams over and over again (usually immediately after a system failure): “How do we plan for unplanned outages?”.

It is ironic that we are trying to plan for something unplanned. The simplicity of the statement almost gives it an innocence. But there is truth to what it asks.

Test Driven Infrastructure

In a previous post about Test Driven Infrastructure I promoted the use of TDD (Test Driven Development) methodologies for building infrastructure and application systems. It’s an important part of how we get to the four 9 or five 9 design. We cannot just throw down a “guarantee” or a “promise” of uptime if we do not fully understand what it means.

The ideal case for any system is that we can design, build, test for failure and then and only then do we really see the potential uptime. If you’ve been involved in BCP, you also understand that there are levels of failure that we plan for. Somethings are beyond the ability to plan for or are so cost prohibitive that we can’t implement the “perfect design”.

So what do I tell my customer?

We can only speak to historical uptime. Have you heard the statement “past performance is no guarantee of future returns”? We also generally don’t expose our entire end-to-end system design to every customer in every case because it would be challenging, and nearly impossible as systems change over time.

As a provider of services (whatever those may be) you will be committed to some SLA (Service Level Agreement) and as a part of that agreement you will have metrics defined to say where we pass or fail. Another key part of that will be a definition of what we do when we miss an SLA. Do we define our SLA over a week, month, year? It’s a great and important question.

What now?

I don’t want to sound like a negative Nelly, but I do want to raise the awareness of designers, programmers, admins, architects and management all over that we need to do our best to be aware of vulnerabilities, exposures and this may move into privacy and security which are ultimately part of the overall picture.

Not too long ago, Dropbox suffered an exposure because of the simplest possible thing: employee password hack. Regardless of their globally distributed systems and highly available systems, a single password opened the door to a potentially fatal breach.

So to refer to Donald Rumsfeld, we have “unknown unknowns” that cannot be accounted for, we also have many “known unknowns” that we can get closer to understanding and preparing for.

Break out the Visio diagrams and take a deeper look into where you may have some exposure. And as you do that, you may realize that it is just part of the design and is unavoidable, but it is better to know than to find out the hard way.

BCP/DR Primer – Part 5 – Test the plan, don’t plan the test

In the final post in the BCP/DR Primer series we are wrapping up with the final task in BCP program; testing the BCP plan. Another piece of debate is the semantics of the word “test”. Many BCP programs will refer to these as “exercises” rather that “tests” but regardless of the label we apply, this is the ultimate result that we must be able to get to.

Test the plan, do not plan the test

It’s a simple statement, but it could not be more important. The concept of the DR test is to apply the plan you have in place to test the recovery of your systems and prove the effectiveness of your plan. Much like Test Driven Development, you may find that you do not hit the mark in the first pass. This is important in illustrating that the BCP plan is an organic document that must continue to grow and develop over time.

The real heart of the phrase “test the plan, don’t plan the test” is that you should not be designing a test to be successful. What I mean by this is that you should be performing the recovery in as natural an order as possible. Many organizations even have some teams involved where they simulate a true recovery scenario by involving vendors and internal support teams with limited notice. What this does is add more realism to the test to ensure that following the plan will produce the result you desire.

Failure is not always failure

Failure may be a strong word, but when you are performing a recovery test and one of the components you have planned for fails in its recovery, you have not necessarily failed, but you have learned that the plan requires additional data. One of my colleagues likes to refer to these as “challenges” and not failures. We use the word “learnings” as well (which isn’t really a word, but we use it anyways).

The long and the short of it is that you must take the issue that caused either a delay or failure of a piece of your recovery and then adjust the documentation and the plan accordingly to work around it.

The Deadly Embrace – a recovery nightmare realized

One thing that will stop any plan in its tracks regardless of its effectiveness is the deadly embrace. This is a database term where you have two or more processes accessing a single item where neither will relent, and neither will continue until the other process releases the item. In other words, a stalemate or a deadlock. The end result: nobody wins.

If you have a situation arise during your BCP recovery in a test or in a real situation, the deadly embrace will stop your plan in its tracks and require you to effectively restart the entire process. During a recovery test, you most likely will not have enough time to re-run the entire test during your test window which means that you will have to close out the exercise and plan a second attempt when resources are available.

The Post Mortem

It isn’t just for Quincy M.D. anymore. The Post Mortem meeting is a necessity to evaluate the recovery test and take our learnings from the process so that we can apply them to the documents and the plan to alleviate those issues for future recovery tests, or more importantly a real recovery scenario.

Be prepared to have detailed discussions in this meeting, and be prepared to apply as much time as necessary to the process. If we do not open our minds and our plan up to what really took place during the recovery, we cannot be as effective as we need to be in maintaining the best possible recovery plan for your organization.

And you must remember that there will be people processes which are involved here, not just IT processes. While our focus in the IT organization is the technical documentation and the bits and bytes portion of the recovery, there are many processes that require warm bodies to be the key and regardless of the most perfect technical recovery plan, without a person to implement it, the plan may as well be written in Sanskrit.

You’re never done

The interesting challenge with BCP/DR programs is that they are never completed. Because your IT and business are dynamic, so should be your BCP/DR program. Once you have your environment documented, and tested, you can consider this to be complete as of a point in time. The next step is to continue to engage the business in participating in updating, managing and revisiting your BCP plans with regularity to keep the recovery plan as current as your production environment.

The last task in your Post Mortem meeting is setting the next planning meeting. No, seriously, you must maintain the momentum and focus to continue to keep the program active. Once you have your baseline work done it will be simpler.

Change Management and BCP

The best way to keep focus on your BCP program is to fully integrate it with your Change Management process. For each application and infrastructure change there should be a checkbox in the process which asks “does this change affect the BCP recovery plan?”. If you involve your business sponsors and application owners with BCP as part of their day-to-day processes for design and implementation then it will ease the pain and raise the awareness of the importance of the recovery planning and BCP program all around.

Want to talk?

I’ve been working with BCP programs for years, and the one thing that I have learned is that an outside opinion can be very helpful. Feel free to drop a comment to me or Tweet me and I would be happy to offer anything that I can to help you along the way.

Truthfully this could be a never ending set of posts, but my goal was to try to help those who either have little or no BCP experience to get to the first steps, or to formalize their process. The more we do, the better it is for all of us in our respective organizations.

BCP/DR Primer – Part 4 – Application Recovery Document and “The Plan”

So let’s review how we have come to this point in the process. In Part 1 we discussed the general strategy of the BCP program and the IT participation in the process.

In Part 2 we moved into the definition of the BCP Tiers and the factors involved to begin mapping our systems and requirements.

Then in Part 3 we built the BCP Recoverability Matrix. In Part 4 we need to delve into the applications and detailed documents associated to them.

The Application Recovery Document

It cannot be restated enough that the key feature of an effective BCP program is effective documentation. The Application Recovery Document (shortened to ARD for ease of writing) is going to be the most important of the overall document store.

What I mean to say that the ARD is the most important, is that it is designed to be a self-contained recovery instruction set for each individual system. For this reason, the ARD is also the most challenging to put together.

In an article I wrote about Writing for your Audience I brought up that we often refer to having to “write it so a monkey could do it based on your instructions”. This is one instance where you will need to the level of detail is critical. We have to assume that in the event of a disaster event that we may not have our key resources our Subject Matter Expert (SME) for each system to perform the recovery.

To get you started, I’ve uploaded an ARD template here which should be a good baseline to work from. I encourage you to read it through thoroughly and add any customizations which will make the readability, and usability as best as possible to suit your needs. Ultimately, this is the crust to your finest pie. Foundation is everything.

Inch by Inch, Row by Row

Much like a the nursery school rhyme about a garden, you will build the ARD for each system/application listed in your recovery with detail and order. Each section of the document clearly defines the point in the recovery process and the specific steps will be listed throughout.

Most of this portion of the BCP program must be flavoured to taste based on your specific business requirements, technical requirements and ability to capture and convey clear instructions and process flow.

You will include every relevant piece of information including batch processes, backup processes, operational details, data management requirements, dependencies (this is very important!) and line by line detailed steps to complete a full system recovery.

Dependencies are particularly important because as you build the detailed recovery timeline, you will have full details about recovery order based on dependencies and requirements of your systems.

Coming Home

As you can see, there is a section called Return to Normal/Home Phasewhich is also an absolutely important detail in the recovery process. Remember that BCP recovery isn’t just a one way trip. You will at some point be recovering your environment again in the primary data center.

During BCP tests you will also follow the recovery to home process, and in fact you may have specific instructions for test recovery which differ from normal production recovery because you may have data used for test fail-over which is not necessarily designed to be brought back over to production whereas during a true fail-over you will want to preserve changed data and bring it back to your primary site.

Storing the documents

Another important part of your BCP program is the storage of your recovery plans, BIA, contact lists and fundamental information required to complete the secondary site recovery. There are numerous systems which are available to store, replicate and manage your documents to have them available during test and real recovery scenarios.

You are encouraged to look an many alternatives, and if you have a vendor of choice for your BCP site recovery, you should work with them or an appropriately skilled team to manage this part of the overall program.

Please also be aware of any legal requirements and limitations for information storage, and specifically on contact information. Your legal department and Human Resources will inevitably be involved in the BCP program, and you should ensure that they are fully aware of the details being held in your document repository. There may be regulations based on your geography or industry that will differ from others.

I love it when a plan comes together

John “Hannibal” Smith couldn’t have said it better. While the ARD and the infrastructure components are important, they are of little value unless you can piece together the recovery sequence with a plan. Your BCP coordinator will either be, or work closely with your Project Management Office (PMO) for designing and maintaining the recovery test plans and the full fledged recovery plan.

We speak of the “test plan” versus the real plan, but in fact they should be one in the same. The only difference with the test recovery is some of the data recovery processes which will be clearly documented in the ARD for each system.

Please excuse the vagueness of this as the template for a BCP recovery plan can differ greatly from organization to organization. If you would like some samples I would be happy to provide some if you Tweet me.

What’s Next?

This portion of the post series was lighter on detail because at this point it will be very organization-specific with the high level details such as vendor choices, recovery site layout and other personal choices in the BCP program. It is also heavily affected by budget constraint. Now that we have built the fundamentals of what goes into the BCP design, you are able to apply your own organizational detail based on your requirements, ability, vendor choice and vendor involvement level.

The final post in the series will talk about the BCP recovery test process and bringing all of the steps together. We’ve built a great foundation of detail, documentation, knowledge capture and increased familiarity of our systems. Now we just have to put the plan to the test to validate what we’ve done.

BCP/DR Primer – Part 3 – The BCP Recoverability Matrix

In our last postwe laid out the RPO and RTO of a couple of systems to illustrate how we can map our systems into BCP tiers. Now that we have defined our current state of the core IT applications and services we will examine the requirements and limitations and define how we can maintain, or increase the recoverability and redundancy of those systems.

Defining your IT BCP capabilities: The BCP Recoverability Matrix

The focus of this series is on the IT systems and how you can define and increase the availability and recoverability to meet your business requirements for alternate site recovery. I’ve created a template that I’ve used for many years which provides a simplified view of your environment. The spreadsheet is called the BCP Recoverability Matrix. It is a Microsoft Excel 2010 XLSX file so you may find that it appears as a ZIP file. Just rename the extension to .XLSX after download. The matrix can be found HERE.

The matrix spreadsheet has 3 tabs which are Current State, Desired State and Application List. The Application List will hold the information about each application from your BIA. This will be a snapshot of information which you need to place each application into the matrix to view it’s overall recoverability level.

Here is what the matrix looks like. Again, remember that we use the lowest common denominator of RPO and RTO to show the actual place on the matrix, so these factors will define the BCP Tier (color coded) as the application appears in the matrix.

Oxygen Services

The first thing that we have to do is define all of the core IT services, or as we call them Oxygen Services which will be required to begin recovery of the business application systems. Without the Oxygen Services we would not have any method by which to begin recovery. Many of these services are already Tier 1 and fully automated but it is still of absolute importance that we document each system in the matrix.

Business Applications and Dependency Applications

Once we have our core infrastructure and recovery infrastructure mapped out, we are now tasked with the putting the business applications onto the matrix. The key part of understanding the business application is understanding they underlying dependencies that are required to bring or keep it online.

Take a web application. While the business sponsor may understand it to simply be an application server, there may be multiple dependency applications, database connections, third-party connections, firewall and VLAN considerations and much more. The goal of the BIA is to have the business define their application needs, and from that we then fill in the blanks to build the dependency diagrams and document any downstream services.

This series is about the process of defining BCP, so you may already have your own methods, or you may want to search out software solutions to fully and effectively document the application environment. There are lots of great solutions available to assist with the process, and you may find that you wish to build your own depending on how “original” your configuration is.

The absolutely most important part of any BCP program is clear and effective documentation of your systems. And you also have to be able to access these documents and information in your recovery sites, so consider that requirement when you are looking at document management and information storage for your BCP program.

Once you are able to map out the dependencies you will meet with your business representatives to confirm that the resulting recovery time-lines meet their expectations and requirements.

Understanding People and Prioritization

There is one thing that you will learn very quickly during this process: People do not want to pay for, wait for, invest time in recovery. That’s a bold statement, but you have to understand that a business sponsor has one responsibility: conduct business. Their focus is on the business and people process, and they look to us as IT SME (Subject Matter Experts) to be able to provide technical solutions for business problems.

Along with providing daily operational support for their environment, there is a need, and sometimes an assumption, that there is BCP “built-in” to their application environment. During the BIA you’ve discovered their business needs, and now that we have laid out the technical dependencies we will present them with what may be an unfortunate set of news about how quickly their particular system can be recovered.

Also, when you tell someone that it will take between 4 to 24 hours to recover a system, guess when they will expect it to be available? Your phone will begin ringing at 4 hours and 1 minute asking “is it online yet?”. A key phrase you will learn from this if you don’t already know it is “management of expectations”. This is where you as the IT organization must raise the awareness and understanding of the business on what will take place during a recovery.

So where the BIA collected information, and you have documented the current state of recoverability of each system, you will now meet with the business to evaluate their comfort with what can be done in the event of initiating the BCP plan. This may be a rude awakening for some on where there are cracks in the armor of a particular system, and it will potentially introduce additional cost to the business.

One more thing that you may learn about many business representatives is that they do not wish to participate in this process. It’s not that they don’t want to be able to recover their systems, but as mentioned earlier, there may be an assumption that this should just be part of normal operations and thus shouldn’t require more interaction from the business to make it happen. Remember that while we are here to enable business through technology, that they have one single goal which is to run the business.

Up and Over

When we look at the BCP Recoverability Matrix, we see that the categories of RPO and RTO are setup so that the are descending downwards for RPO and to the right for RTO. What we like to be able to do with systems on this matrix is move them higher in both the RPO and RTO to increase the recoverability, and reduce the manual interaction required to make this happen.

The goal of the up and over process is to be able to reduce the effort required by IT to provide the business requirement of recoverability. The goal of IT in this is to also do this with as little expense to the business organization as possible. Ultimately we want to be able to reduce the work required by our resources.

Under-Promise, Over-Deliver

Have you ever heard that phrase? BCP isn’t the only place where you want to apply that tactic, but it is most certainly one of the most important. If the business requires only 24 hour old data (aka “tape”) then we should replicate the data asynchronously. We will meet and exceed the requirements, but it may not be necessary to make the business expectation as near-zero data loss.

Why would we not attach the new ability to the plan? Great question, but as we discussed about mapping the recovery Tiers you do not want to suddenly make the requirement to be a sub-4 hour RPO which could place an undue stress on your team and your infrastructure. If you can do so when the business doesn’t absolutely need it you have just saved yourself, and your business sponsor some grief in a recovery situation.

What’s Next?

We have a pretty good view of the overall requirements at this point. The next post will introduce the Application Recovery Document and defining the recovery schedule for our overall matrix.