Designing Robust Systems

This document describes best practices for designing robust systems
on Compute Engine. It provides general advice and covers some features
in Compute Engine that can help mitigate instance downtime and prepare
for times where your virtual machine (VM) instances suffer an unexpected failure.

A robust system is a system that can withstand a certain amount of failures
or disruptions without interrupting your service or affecting your users'
experience using your service. While Compute Engine makes every
effort to prevent such disruptions, certain events are unpredictable and it is
best to be prepared for these events.

Understanding types of failures

At some point, one or more of your VM instances might be lost due to system or
hardware failures. Some of the failures include but are not limited to:

Unexpected single instance failure

Unexpected single instance failures can be due to hardware or system failure.
To mitigate these events, use
persistent disks and
startup scripts to save your data and re-enable software after
you restart the instance.

Unexpected single instance reboot

At some point in time, you will experience an unexpected single instance
failure and reboot. Unlike unexpected single instance failures, your
instance fails and is automatically rebooted by the Google Compute Engine service. To
help mitigate these events, back up your data, use
persistent disks, and use
startup scripts to quickly re-configure software.

Zone or region failures

Zone and region failures are rare failures that can cause all of your
instances in a given zone or region to be inaccessible or fail.

Tips for designing robust systems

To help mitigate instance failures, you should design your application on the
Google Compute Engine service to be robust against failures, network
interruptions, and unexpected disasters. A robust system should be able to
gracefully handle failures, including redirecting traffic from a downed
instance to a live instance or automating tasks on reboot.

Here are some general tips to help you design a robust system against failures.

Use live migration

Google periodically performs maintenance on its infrastructure by
patching systems with the latest software, performing routine tests and
preventative maintenance, and generally ensuring that our infrastructure is as
secure, fast, and efficient as possible. Compute Engine employs
live migration to ensure that this infrastructure maintenance is transparent
by default to your virtual machine instances.

Live migration is a technology that Google has built to move your running
instances away from systems that are about to undergo maintenance work.
Compute Engine does this automatically.

During live migration, your instance might experience a decrease in
performance for a short period of time. You also have the option to configure
your virtual machine instances to terminate and reboot away from the maintenance
event. This option is suitable for instances that demand constant,
maximum performance, and when your overall application is built to handle
instance failures or reboots.

For more information about live migration, see the
Live Migration documentation.

Distribute your instances

Create instances across more than one region and zone so that you have
alternative virtual machine instances to point to if a zone or region containing
one of your instances is disrupted. If you host all your instances in the same
zone or region, you will not be able to access any of these instances if that
zone or region becomes unreachable.

Create groups of instances

Use managed instance groups
to create homogeneous groups of instances so that load balancers can direct
traffic to more than one VM instance in case a single VM becomes unhealthy.

Managed instance groups also offer features like autoscaling
and
autohealing.
Autoscaling lets you deal with spikes in traffic by scaling the number of VM
instances up or down based on specific signals, while autohealing performs
health checking and if necessary, automatically recreates unhealthy instances.

Use load balancing

Google Compute Engine offers a load balancing service that helps you support
periods of heavy traffic so that you don't overload your instances. With the
load balancing service, you can:

Deploy your application on instances within multiple zones using
regional managed instance groups.
Then, you can configure a
forwarding rule that
can spread traffic across all virtual machine instances in all zones within the
region. Each forwarding rule can define one entry point to your application
using an external IP address.

Deploy instances across multiple regions.
Cross-regional load balancing
provides redundancy so that if a region is unreachable, traffic will
automatically be diverted to another region so that your service remains
reachable using the same external IP address.

In addition, the load balancing service also offers VM health checking,
providing support in detecting and handling instance failures.

Use startup and shutdown scripts

Compute Engine offers startup and shutdown scripts that run when
an instance boots up or shuts down, respectively. These scripts can automate
tasks like installing software, running updates, making backups, logging
data, and so on, when your instance first starts up or when your instance
is shut down, either intentionally or not.

Both startup and shutdown scripts are an efficient and invaluable way to
bootstrap or cleanly shut down your instances. Instead of configuring your
instances using custom OS images, it can be beneficial to configure instances
using startup scripts. Startup scripts run whenever the instance is rebooted or
restarted due to failures, and can be used to install software and updates, and
to ensure that services are running within the VM. Coding the changes to
configure an instance in a startup script is easier than trying to figure out
what files or bytes have changed on a custom image.

Shutdown scripts can perform last minute tasks like backing up data, saving
logs, and gracefully terminating connections before you stop an instance.

Back up your data

Back up your data regularly and in multiple locations. You can back up your
files to Google Cloud Storage,
create persistent disk snapshots, or
replicate your data to a persistent disk in another region or zone.

To copy files from an instance to Google Cloud Storage:

Log into your instance:

gcloud compute ssh example-instance

If you have never used the gsutil tool on this instance, set up your
credentials.

gcloud init

Alternatively, if you have set up your instance to use a
service account with a Google Cloud
Storage scope, you do can skip this and the next step.

Follow the instructions to authenticate to Google Cloud Storage.

Copy your data to Google Cloud Storage by using the following command: