Scenarios for Restoring a Windows HPC Server 2008 R2 Cluster

Updated: July 29, 2016

Applies To: Microsoft HPC Pack 2008 R2

This section provides an overview of the recommended steps to restore a Windows® HPC Server 2008 R2 cluster. Because of the variety of cluster deployment options, including options to configure the head node for high availability in a failover cluster and to use remote servers running Microsoft SQL Server to store the HPC databases, you should use the restoration steps that are appropriate for your cluster configuration. The restoration steps include links to separate topics in this guide that contain detailed recovery procedures.

Some recovery scenarios allow you to restore only cluster configuration data, not the data that is stored in the HPC databases. Recovered configuration data can often allow critical cluster operations to resume, but the backed up HPC database data cannot be used in the recovered cluster.

The following is a list of the high level scenarios that are described in this section:

The following are general recovery steps for several failure scenarios in a Windows HPC Server 2008 R2 cluster that contains a single head node.

Recover the entire Windows HPC Server cluster

Follow these general steps to recover an entire Windows HPC cluster, in the case where an entire site becomes unavailable.

In another site, on a computer that meets the system requirements for the head node of the cluster, perform a clean installation of HPC Pack 2008 R2. For more information, see Deploy the Head Node in the Design and Deployment Guide for Windows HPC Server 2008 R2.

Recover the cluster configuration settings. The exact settings depend on the settings that were previously backed up and the backup method, but they can include cluster data that is stored in shared folders, node templates, and job templates, in addition to custom application programs and service DLLs. For more information, see Recover the Windows HPC Cluster Configuration Settings in this guide.

Redeploy the compute nodes and broker nodes in your cluster by using an appropriate deployment method.

Recover a failed head node computer

Follow these general steps in the case of an unrecoverable hardware failure on the head node computer. The existing HPC databases can be on the head node or on a remote server running SQL Server.

On a new computer that meets the system requirements for the head node of the cluster, perform a clean installation of HPC Pack 2008 R2. For more information, see Deploy the Head Node in the Design and Deployment Guide for Windows HPC Server 2008 R2.

Recover the cluster configuration settings. The exact settings depend on the settings that were previously backed up and the backup method, but they can include cluster data that is stored in shared folders, node templates, and job templates, in addition to custom application programs and service DLLs. For more information, see Recover the Windows HPC Cluster Configuration Settings in this guide.

Perform a full-system restore of the head node

If Windows HPC Server 2008 R2 files or SQL Server databases are corrupt on the head node, initiate Windows System Backup to perform a full-system restore.

Important

You can perform a full-system restore only by using the backups that you have created by using Windows System Backup on the same computer. For more information, see Windows Server Backup.

Recover a failed remote database server

Follow these general steps in the case of a hardware failure on a remote server running SQL Server, where the HPC databases are installed. In this scenario, the head node of the cluster is assumed to be functioning properly.

On a computer that meets the system requirements for SQL Server, install SQL Server 2008 SP1 or later. For more information, consult the documentation for your version of SQL Server.

Restore the HPC databases in SQL Server. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.

For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

Stop the following services on the head node of the Windows HPC Server 2008 R2 cluster: hpcscheduler, hpcmanagement, hpcreporting, hpcsdm, hpcdiagnostics, and hpcdsc.

Note

The hpcdsc service is installed only in Windows HPC Server 2008 R2 or later.

If you previously deployed Windows Azure nodes and the state of the nodes has changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.

Restart or, if necessary, redeploy the compute nodes and broker nodes in your cluster by using an appropriate deployment method.

If there are Windows Azure nodes that are online and have a health state of Error, in HPC Cluster Manager, manually stop the Windows Azure nodes.

Warning

Ensure that you have already stopped the deployment in the Windows Azure hosted service, as outlined in a previous step. If you do not stop the deployment first in the Windows Azure Management Portal, you will be unable to stop the Windows Azure nodes by using HPC Cluster Manager.

Note

If the nodes are deployed by using a node template that includes a policy to start and stop the nodes automatically, you should first edit the node template to configure a policy to start and stop the Windows Azure nodes manually. Then stop the Windows Azure nodes.

After the Windows HPC cluster reaches a stable state, you can restart the Windows Azure nodes.

Restore SQL Server databases on a remote server

Follow these general steps in the case where an HPC database fails or becomes corrupt on a remote server running SQL Server.

Restore the HPC databases in SQL Server. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.

For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

The following are general recovery steps for several failure scenarios in a Windows HPC Server 2008 R2 cluster that contains a head node configured for high availability in the context of a failover cluster.

Recover the entire Windows HPC Server cluster

Follow these general steps to recover an entire Windows HPC cluster that is configured for high availability of the head node in a failover cluster, in the case where an entire site becomes unavailable. You should also follow these steps if you need to recover the resource groups for the failover cluster.

On computers that meet the system requirements for a high availability configuration of the head node, perform a clean installation of Windows HPC Server 2008 R2 where the head node is configured in a failover cluster. Depending on your requirements, you can choose to install SQL Server for the HPC cluster on the same servers as the head node or on one or more remote servers that are running SQL Server. For more information, see Configuring Windows HPC Server 2008 R2 for High Availability of the Head Node.

Recover the cluster configuration settings on the active head node. The exact settings depend on the settings that were previously backed up, but they can include cluster data that is stored in shared folders, in addition to custom application programs and service DLLs. For more information, see Recover the Windows HPC Cluster Configuration Settings in this guide.

Stop and disable the hpcmanagement and hpcreporting services on both head nodes and take offline the four HPC services that are in the resource group for the failover cluster.

Important

If you do not stop the HPC services on both head nodes before restoring the databases, database inconsistencies will be reported during the restore operation. You will then need to begin the restoration steps again.

If you previously deployed Windows Azure nodes and the state of the nodes has changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.

Restore the HPC databases.

On the first head node on which you will enable and start the HPC services, configure the HPC Job Scheduler service for restore mode.

Enable and start the HPC services on the head node.

Restart or, if necessary, redeploy the compute nodes and broker nodes in your cluster by using an appropriate deployment method.

If there are Windows Azure nodes that are online and have a health state of Error, in HPC Cluster Manager, manually stop the Windows Azure nodes.

Warning

Ensure that you have already stopped the deployment in the Windows Azure hosted service, as outlined in a previous step. If you do not stop the deployment first in the Windows Azure Management Portal, you will be unable to stop the Windows Azure nodes by using HPC Cluster Manager.

Note

If the nodes are deployed by using a node template that includes a policy to start and stop the nodes automatically, you should first edit the node template to configure a policy to start and stop the Windows Azure nodes manually. Then stop the Windows Azure nodes.

After the Windows HPC cluster reaches a stable state, you can restart the Windows Azure nodes.

Replace a single failed head node in the failover cluster

If a single head node that is configured in a failover cluster no longer functions properly because of a hardware or software failure, the cluster still functions properly by using the remaining head node. However, the failed server needs to be replaced to restore the high availability configuration of the head node. For procedures to evict the failed head node server from the failover cluster, prepare and add the new server to the failover cluster, and install HPC Pack 2008 R2 on the new server, see Replacing a Head Node Configured in a Failover Cluster in Windows HPC Server 2008 R2.

Recover a failed remote database server that is part of a SQL Server failover cluster

If a hardware or database failure occurs on a remote server running SQL Server that is configured as a failover cluster, you can recover the failed server. In this scenario, the high availability head nodes of the cluster are otherwise assumed to be functioning properly. To recover a server that is running SQL Server, consult the documentation for your edition of SQL Server.