Tag: RTO

In my previous blog, I have written how to migrate workloads from VMware to Azure Cloud. In this tutorial, I am going to elaborate you how to migrate Amazon Web Services (AWS) EC2 virtual machines (VMs) to Azure VMs by using Azure Site Recovery.

Supported Workloads Which can be migrated:

Windows Server 2016 or later version

Red Hat Enterprise Linux 6.7

Prerequisites

The Mobility service must be installed on each VM that you want to replicate. Site Recovery installs this service automatically when you enable replication for the VM.

For non-domain joined Windows VMs, disable Remote User Access control on the local machine at the registry, under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System, add the DWORD entry LocalAccountTokenFilterPolicy and set the value to 1.

A separate VM in AWS subscriptions to use as Site Recovery Configuration Server. This instance must be running Windows Server 2012 R2.

Credential Requirements

A root on the source Linux server

A Domain Admin Credentials for Windows VM.

A Local Admin Account for non-domain joined VM.

Prepare Azure resources (Target)

Step1: Create a Storage Account

In the Azure portal, in the left menu, select Create a resource > Storage > Storage account.

Create a Storage Account in your region.

Step2: Create a Recovery Vault

In the Azure portal, select All services. Search for and then select Recovery Services vaults.

In RPO threshold, specify the recovery point objective (RPO) limit. This value specifies how often data recovery points are created. An alert is generated if continuous replication exceeds this limit.

In Recovery point retention, specify how long (in hours) the retention window is for each recovery point. Replicated VMs can be recovered to any point in a window. Up to 24 hours retention is supported for machines replicated to premium storage, and 72 hours for standard storage.

In App-consistent snapshot frequency, specify how often (in minutes) recovery points containing application-consistent snapshots will be created. Click OK to create the policy.

Prepare Source Environment (AWS)

Step6: Prepare Source ASR Configuration Server

Log on to the EC2 instance where you would like to install Configuration Server

Configure the proxy on the EC2 instance VM you’re using as the configuration server so that it can access the service URLs.

In Summary, select Install. Installation Progress shows you information about the installation process. When it’s finished, select Finish. A window displays a message about a reboot. Select OK. Next, a window displays a message about the configuration server connection passphrase. Copy the passphrase to your clipboard and save it somewhere safe.

On the VM, run cspsconfigtool.exe to create one or more management accounts on the configuration server. Make sure that the management accounts have administrator permissions on the EC2 instances that you want to migrate.

Step7: Enable Replication for a AWS EC2 VM

Click Replicate application > Source.

In Source, select the configuration server.

In Machine type, select Physical machines.

Select the process server (the configuration server). Then click OK.

In Target, select the subscription and the resource group in which you want to create the Azure VMs after failover. Choose the deployment model that you want to use in Azure (classic or resource management).

Select the Azure storage account you want to use for replicating data.

Select the Azure network and subnet to which Azure VMs will connect, when they’re created after failover.

Select Configure now for selected machines, to apply the network setting to all machines you select for protection. Select Configure later to select the Azure network per machine.

In Physical Machines, and click +Physical machine. Specify the name and IP address. Select the operating system of the machine you want to replicate. It takes a few minutes for the servers to be discovered and listed.

In Properties > Configure properties, select the account that will be used by the process server to automatically install the Mobility service on the machine.

Click Enable Replication. You can track progress of the Enable Protection job in Settings > Jobs > Site Recovery Jobs. After the Finalize Protection job runs the machine is ready for failover.

Test failover at Azure Portal

Step8: Test a Failover

On the page for your vault, go to Protected items > Replicated Items. Select the VM, and then select Test Failover.

Select a recovery point to use for the failover:

Latest processed: Fails over the VM to the latest recovery point that was processed by Site Recovery. The time stamp is shown. With this option, no time is spent processing data, so it provides a low recovery time objective (RTO).

Latest app-consistent: This option fails over all VMs to the latest app-consistent recovery point. The time stamp is shown.

Custom: Select any recovery point.

In Test Failover, select the target Azure network to which Azure VMs will be connected after failover occurs. This should be the network you created in Prepare Azure resources.

Select OK to begin the failover. To track progress, select the VM to view its properties. Or you can select the Test Failover job on the page for your vault. To do this, select Monitoring and reports > Jobs > Site Recovery jobs.

When the failover finishes, the replica Azure VM appears in the Azure portal. To view the VM, select Virtual Machines. Ensure that the VM is the appropriate size, that it’s connected to the right network, and that it’s running.

You should now be able to connect to the replicated VM in Azure.

To delete Azure VMs that were created during the test failover, select Cleanup test failover in the recovery plan. In Notes, record and save any observations associated with the test failover.

In Failover, select a Recovery Point to failover to. Select the latest recovery point.

Select Shut down machine before beginning failover if you want Site Recovery to attempt to do a shutdown of source virtual machines before triggering the failover. Failover continues even if shutdown fails. You can follow the failover progress on the Jobs

Ensure that the VM appears in Replicated items.

Right-click each VM, and then select Complete Migration. This finishes the migration process, stops replication for the AWS VM, and stops Site Recovery billing for the VM.

Taking a VMware snapshots and Hyper-v checkpoint can produce a serious workload on VM performance, and it can take considerable effort by sys admin to overcome this technical challenge and meet the required service level agreement. Most Veeam user will run their backup and replication after hours considering impact to the production environment, but this can’t be your only backup solution. What if storage itself goes down, or gets corrupted? Even with storage-based replication, you need to take your data out of the single fault domain. This is why many customers prefer to additionally make true backups stored on different storage. Never to store production and backup on to a same storage.

Source: Veeam

Now you can take advantage of storage snapshot. Veeam decided to work with storage vendor such as EMC and NetApp to integrate production storage, leveraging storage snapshot functionality to reduce the impact on the environment from snapshot/checkpoint removal during backup and replication.

Supported Storage

EMC VNX/VNXe

NetApp FAS

NetApp FlexArray (V-Series)

NetApp Data ONTAP Edge VSA

HP 3PAR StoreServ

HP StoreVirtual

HP StoreVirtual VSA

IBM N series

Unsupported Storage

Dell Compellent

NOTE: My own experience with HP StoreVirtual and HP 3PAR are awful. I had to remove HP StoreVirtual from production store and introduce other fibre channel to cope with workload. Even though Veeam tested snapshot mechanism with HP, I would recommend avoid HP StoreVirtual if you have high IO workload.

Benefits

Veeam suggest that you can get lower RPOs and lower RTOs with Backup from Storage Snapshots and Veeam Explorer for Storage Snapshots.

Veeam and EMC together allow you to:

Minimize impact on production VMs

Rapidly create backups from EMC VNX or VNXe storage snapshots up to 20 times faster than the competition

Easily recover individual items in two minutes or less, without staging or intermediate steps

As a result of integrating Veeam with EMC, you can backup 20 times faster and restore faster using Veeam Explorer. Hence users can achieve much lower RPOs (recovery point objectives) and lower RTOs (recovery time objectives) with minimal impact on production VMs.

How it works

Veeam Backup & Replication works with EMC and NetApp storage, along with VMware to create backups and replicas from storage snapshots in the following way.

Source: Veeam

The backup and replication job:

Analyzes which VMs in the job have disks on supported storage.

Triggers a vSphere snapshot for all VMs located on the same storage volume. (As a part of a vSphere snapshot, Veeam’s application-aware processing of each VM is performed normally.)

Triggers a snapshot of said storage volume once all VM snapshots have been created.

Retrieves the CBT information for VM snapshots created on step 2.

Immediately triggers the removal of the vSphere snapshots on the production VMs.

Mounts the storage snapshot to one of the backup proxies connected into the storage fabric.

Reads new and changed virtual disk data blocks directly from the storage snapshot and transports them to the backup repository or replica VM.

Triggers the removal storage snapshot once all VMs have been backed up.

VMs run off snapshots for the shortest possible time (Subject to storage array- EMC works better), while jobs obtain data from VM snapshot files preserved in the storage snapshot. As the result, VM snapshots do not get a chance to grow large and can be committed very quickly without overloading production storage with extended merge procedure, as is the case with classic techniques for backing up from VM snapshots.

Integration with EMC storage will bring great benefit to customers who wants to take advantage of their storage array. Veeam Availability Suite v9 will provide the chance to reduce IO on to your storage array and bring your SLA under control.

A business continuity plan in information technology is a documented plan indicating how a business will continue to operate if IT operation is affected by adverse conditions, such as a storm, fire, interruptions or malicious damage. Such a plan typically explains how the business would operate at the time of disaster and recover from disaster.

In December 2006, the British Standards Institution (BSI) released an independent standard for BCP — BS 25999-1. Prior to the introduction of BS 25999, BCP professionals relied on information security standard BS 7799, which only peripherally addressed BCP to improve an organization’s information security procedures. BS 25999’s applicability extends to all organizations. In 2007, the BSI published BS 25999-2 “Specification for Business Continuity Management”, which specifies requirements for implementing, operating and improving a documented business continuity management system (BCMS).

Which one you need? Business Continuity or Disaster Recovery?

If you ask me, I would prefer to have a Business Continuity Plan that includes a disaster recovery with a smooth fail-over and fail-back option and a service continuity procedures as if disaster never happened.

Most organization will presume that they have Symantec/CommVault/Veeam Backup which protect them from disaster hence they have a disaster recovery plan. This is not the case “Disaster Recovery Plan” or “Business Continuity Plan” does not mean having just only a backup product and presume you have it all.

Note! Disaster Recovery is just part of Business Continuity. My previous post on disaster recovery plan differentiate between disaster recovery and business continuity.

Objectives:

To ensure maximum possible service levels are maintained

To ensure a smooth recovery from interruptions as quickly as possible

To minimize the likelihood and impact (risk) of interruptions

To minimize IT/IS service desk intervention with end user in the event of disaster

Identifying Risk (Example):

Create a spreadsheet and a database of business application, systems, network and other assets that likely be impacted by an event. Here is a sample spread sheet.

Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery is therefore a subset of business continuity. Wiki reference

A disaster recovery plan (DRP) is a documented process or set of procedures to recover and protect a business IT infrastructure in the event of a disaster. Such plan, ordinarily documented in written form, specifies procedures an organization is to follow in the event of a disaster.

Given organizations’ increasing dependency on information technology to run their operations, a disaster recovery plan, sometimes erroneously called a continuity of operations plan (COOP), is increasingly associated with the recovery of information technology data, assets, and facilities.

Disaster Recover VS Business Continuity – Are you mixing up?

Business Continuity is different than a Disaster Recovery but linked together. Business Continuity is a practice enterprise adopt to protect business from a complete failure and wait for recovery. By adopting business continuity, you can be assured that your business will continuity to run in the event of disaster; until all systems are recovered from disaster.

There are many, many businesses that fail after a disaster without a BC plan. DR will get your hardware, software and apps back up and running, but without a business continuity plan to keep your company going during the recovery process, you might not have a reason to recover those items. BC involves your finances, your personnel, your emergency plans and everything else that is a necessity to keep going and serving.

An example of business continuity is that all your corporate inbound and outbound email will come and go via third party cloud based smart host that will store all your email up to 15~30 days but deliver inbound/outbound email straight away to your corporation which means in the event of disaster you will receive and send email from any devices that has internet connectivity. Once systems is restored, cloud based smart host will sync with on premise Exchange Server.

Disaster Recovery Terminology in Alphabetic Order

Alert – Notification that a potential disruption is imminent or has occurred; usually includes a directive to act or standby.

Application Recovery – The component of Disaster Recovery that deals specifically with the restoration of business system software and data after the processing platform has been restored or replaced.

DR Site – A site held in readiness for use during/following an invocation of business or disaster recovery plans to continue urgent and important activities of an organization.

Backlog – The amount of work that accumulates when a system or process is unavailable for a long period of time. This work needs to be processed once the system or process is available and may take a considerable amount of time to process.
A situation whereby a backlog of work requires more time to action than is available through normal working patterns. In extreme circumstances, the backlog may become so marked that the backlog cannot be cleared.

Backup – A process by which data (electronic or paper-based) and programs are copied in some form so as to be available and used if the original data from which it originated is lost, destroyed or corrupted.

Business Continuity – The strategic and tactical capability of the organization to plan for and respond to incidents and business disruptions in order to continue business operations at an acceptable predefined level.

Checklist Tool to remind and /or validate that tasks have been completed and resources are available, to report on the status of recovery. A list of items (names or tasks etc.) to be checked or consulted.

Contingency Plan An event specific preparation that is executed to protect an organization from certain and specific identified risks and/or threats.

Continuous Availability A system or application that supports operations which continue with little to no noticeable impact to the user. For instance, with continuous availability, the user will not have to re-log in, or to re-submit a partial or whole transaction.

Data Backups The copying of production files to media that can be stored both on and/or offsite and can be used to restore corrupted or lost data or to recover entire systems and databases in the event of a disaster.

Data Center Recovery- The component of Disaster Recovery which deals with the restoration of data center services and computer processing capabilities at an alternate location and the migration back to the production site.

Data Recovery- The restoration of computer files from backup media to restore programs and production data to the state that existed at the time of the last safe backup.

Database Replication- The partial or full duplication of data from a source database to one or more destination databases.

Disaster- Situation where widespread human, material, economic or environmental losses have occurred which exceeded the ability of the affected organization (2.2.9), community or society to respond and recover using its own resources. Source: ISO 2.1.11

Disaster Recovery- The process, policies and procedures related to preparing for recovery or continuation of technology infrastructure, systems and applications which are vital to an organization after a disaster or outage.

Hot site- An alternate facility that already has in place the computer, telecommunications, and environmental infrastructure required to recover critical business functions or information systems.

Impact- The effect, acceptable or unacceptable, of an event on an organization. The types of business impact are usually described as financial and non-financial and are further divided into specific types of impact.

Incident- An event which is not part of standard business operations which may impact or interrupt services and, in some cases, may lead to disaster.

Off-Site Storage Any place physically located a significant distance away from the primary site, where duplicated and vital records (hard copy or electronic and/or equipment) may be stored for use during recovery.

Outage- The interruption of automated processing systems, infrastructure, support services, or essential business operations, which may result, in the organizations inability to provide services for some period of time.

Recovery- Implementing the prioritized actions required to return the processes and support functions to operational stability following an interruption or disaster.

Replication– Copying a point of time, structured or unstructured data from between site(s)

Risk- Potential for exposure to loss which can be determined by using either qualitative or quantitative measures.

Recovery Point Objective- A recovery point objective, or “RPO”, is defined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident. The RPO gives systems designers a limit to work to. Wiki Reference

Recovery Time Objective – The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. Wiki Reference

Service Level Agreement (SLA)- A formal agreement between a service provider (whether internal or external) and their client (whether internal or external), which covers the nature, quality, availability, scope and response of the service provider. The SLA should cover day-to-day situations and disaster situations, as the need for the service may vary in a disaster.

System Recovery- The procedures for rebuilding a computer system and network to the condition where it is ready to accept data and applications, and facilitate network communications.

Validation Script- A set of procedures within the Business Continuity Plan to validate the proper function of a system or process before returning it to production operation.

Workaround Procedures- Alternative procedures that may be used by a functional unit(s) to enable it to continue to perform its critical functions during temporary unavailability of specific application systems, electronic or hard copy data, voice or data communication systems, specialized equipment, office facilities, personnel, or external services.

Developing a DR Strategy

Regarding disaster recovery strategies, ISO/IEC 27031, the global standard for IT disaster recovery, states, “Strategies should define the approaches to implement the required resilience so that the principles of incident prevention, detection, response, recovery and restoration are put in place.” Strategies define what you plan to do when responding to an incident, while plans describe how you will do it.

DR Objectives

Reduce Overall Risk

Maintain and Test Your Disaster Recovery Plan

Alleviate Owner/Investor Concerns

Restore Day-To-Day Operations

Comply With Regulations

Rapid Response

Priority Matrix

Priority

Severity

Impact

Priority 1

Highest

High

Priority 2

Medium High

Medium

Priority 3

Medium

Medium

Priority 4

Medium low

Low

Priority 5

Low

Low

Who and What are involved in a Disaster Recovery

People

Physical facilities

Technology

Data (Structured & Unstructured)

Third Party Vendor or Suppliers

IT Governance(Policies & Procedures)

Producing a DR Document

A DR document consist of the following sections:

Title

Sub-Title

Corporate Logo

Document history.

Corporate Copyright Info

Table of Content

Executive Summary

Introduction

Terminology

Roles and responsibilities.

Third Party

Technologies

Site Diagram

Incident response.

Plan activation.

Procedures.

Appendixes.

In conclusion, once your DR plans have been completed, they are ready to be implemented. This process will determine whether business will recover and restore IT assets as planned. Remember, this is not about IT department, this is about business who wants to comply and understand importance of disaster recovery. You will only succeed if your business is willing to participate and invest CAPEX and OPEX on disaster recovery.