Time To First Byte (TTFB): A good way to measure time to first HTTP
response is to issue a curl command repeatedly to the server to get a
response from the web server.

When comparing results, be aware that latency on fiber links is
constrained mainly by the distance and the speed of light in fiber,
which is roughly 200,000 km/s (or 124,724 miles/s).

The distance between Frankfurt, Germany, and Council Bluffs, Iowa, which is the
location of the us-central1 region, is roughly 7,500 km. With perfectly
straight fiber between the locations, round-trip latency would be:

7,500 km * 2 / 200,000 km/s * 1000 ms/s = 75 milliseconds (ms)

In reality, fiber optic cable doesn't follow an ideal path between the user and
the data center, and light on the fiber cable passes through active and passive
equipment along its path. An observed latency of approximately 1.5 times the
ideal, or 112.5 ms, would indicate a near-ideal configuration.

Comparing latency

This section compares load balancing in the following configurations:

No load balancing

Network Load Balancing

HTTP Load Balancing or TCP Proxy

In this scenario, the application consists of a regional managed instance group
of HTTP web servers. Because the application relies on low-latency calls to a
central database, the web servers must be hosted in one location. The
application is deployed in the us-central1 region, and users are distributed
across the globe. The latency that the user in Germany observes in this scenario
illustrates what users worldwide might experience.

Latency scenario diagram (click to enlarge)

No load balancing

When a user makes an HTTP request, unless load balancing is configured, the
traffic flows directly from the user’s network to the virtual machine (VM)
hosted on Compute Engine. For Premium Tier, traffic then enters Google's
network at an edge point of presence (POP) close to the user's location. For
Standard Tier, the user traffic
enters Google's network at a POP close to the destination region.
For more information, see the Network Service Tiers
documentation.

Architecture with no load balancing (click to enlarge)

The following table shows the results when the user in Germany tested latency
of a system with no load balancing:

The TTFB latency is very stable, as shown in the following graph of the first
500 requests:

Latency to VM in ms graph (click to enlarge)

When pinging the VM IP address, the response comes directly from web server. The
response time from the web server is minimal compared to the network latency
(TTFB). This difference is because a new TCP connection is opened for every HTTP
request, and an initial three-way handshake is needed before the HTTP response
is sent, as shown in the following diagram. Therefore, the latency the user in
Germany observed is roughly double the ping latency.

Client-server HTTP request diagram (click to enlarge)

Network Load Balancing

With a network load balancer, user requests still enter the Google network
at the closest edge POP (in Premium Tier). In the region where the project's VMs
are located, traffic flows first through a Maglev load balancer. It is then
forwarded without changes to the target backend VM. The Maglev load balancer
distributes traffic based on a stable hashing algorithm. The algorithm uses a
combination of source and destination port, IP address, and protocol. The VMs
listen to the load balancer IP and accept the traffic unaltered.

Architecture with Network Load Balancing (click to enlarge)

The following table shows the results when the user in Germany tested latency
for the network-load-balancing option:

Because load balancing takes place in-region and traffic is merely forwarded,
there is no significant latency impact compared with the no-load-balancer option.

HTTP(S)/TCP/SSL Proxy Load Balancing

With HTTP Load Balancing, traffic is proxied through GFEs, which are typically
located at the edge of Google's global network. The GFE terminates the TCP
session and connects to a backend in the closest region that has capacity to
serve the traffic.

HTTP Load Balancing scenario diagram (click to enlarge)

The following table shows the results when the user in Germany tested latency
for the HTTP-load-balancing option:

The results for HTTP Load Balancing are significantly different. When pinging
the HTTP load balancer, the round-trip latency is just over 1 ms. However, this
result represents latency to the closest GFE, which is located in the same city
as the user in this case. This result has nothing to do with actual latency the
user experiences when trying to access the application hosted in the
us- central1 region. This shows that experiments using protocols (ICMP) that
differ from your application communication protocol (HTTP) can be misleading.

When measuring TTFB, the initial requests show roughly the same response
latency. Over the course of the requests, additional requests achieve the
lower minimum latency of 123 ms, as shown in the following graph:

Latency to HTTP load balancer in ms graph (click to enlarge)

However, two round trips between the client and VM would take more than 123 ms
even with perfectly straight fiber. The reason for the lower latency
is that traffic is proxied through GFEs, which keep persistent connections open
to the backend VMs. Therefore, only the first request from a specific GFE
to a specific backend needs a three-way handshake.

Initial HTTP request via GFE diagram (click to enlarge)

There are multiple GFEs in each location. You can see in the latency graph
multiple, fluctuating spikes early on as traffic reaches each GFE-backend pair
for the first time. This reflects differing request hashes. After all GFEs have
been reached, subsequent requests show the lower latency.

Subsequent HTTP request via GFE diagram (click to enlarge)

These scenarios demonstrate the reduced latency that users can experience in a
production environment. The following table summarizes the results:

Option

Ping

TTFB

No load balancing

110 ms to the web server

230 ms

Network Load Balancing

110 ms to the in-region network load balancer

230 ms

HTTP Load Balancing

1 ms to the closest GFE

123 ms

When a healthy application is serving users in a
specific region regularly, all GFEs in that region should generally have
a persistent connection open to all serving backends. Because of this, users in
that region will notice significantly reduced latency on their first HTTP
request if they are far from the application backend. If users are near the
application backend, no latency improvement is observed because of their close
proximity to the backend.

For subsequent requests, such as clicking a page link, no latency improvement is
observed because modern browsers already keep a persistent connection to the
service to be reused, as opposed to a curl command issued from the command line.

Additional latency effects of HTTP(S) Load Balancing

There are some additional observable effects with HTTP(S) Load Balancing
that depend on traffic patterns.

HTTP(S) Load Balancing has less latency for complex assets than Network
Load Balancing because fewer round trips are needed before a response
completes. For example, when the user in Germany measured latency over the
same connection by repeatedly downloading a 10 MB file, the average latency
for Network Load Balancing was 1911 ms compared to 1341 ms with HTTP load
balancing, saving approximately 5 round trips per request. This reduction is
because persistent connections between GFEs and serving backends reduce the
effects of TCP Slow
Start.

HTTP(S) Load Balancing significantly reduces the additional latency for a TLS
handshake (typically 1-2 extra roundtrips). This reduction is because HTTP(S)
uses SSL offloading, and only the latency to the edge POP is relevant. For the
user in Germany, the minimum observed latency is 201 ms using HTTP(S) load
balancing versus 525 ms using HTTP(S) through the network load balancer.

The HTTP(S) load balancer also allows an automatic upgrade of the user-facing
session to HTTP/2, which can reduce the
number of packets needed, by using improvements in binary protocol, header
compression, and connection multiplexing. This can reduce observed latency
even more than that observed by switching to HTTP Load Balancing alone. HTTP/2
is used only in conjunction with current browsers using SSL/TLS. For our user
in Germany, minimum latency decreased further from 201 ms to 145 ms when using
HTTP/2 instead of plain HTTPS.

Optimizing HTTP(S) Load Balancing

You can optimize latency for your application by using the HTTP(S) load balancer
as follows:

You can use any CDN partner with GCP. By using one of Google's CDN
interconnect partners, you benefit from
discounted egress costs.

If content is static, you can reduce the load on the web servers by serving
content directly from Google Cloud Storage through the HTTP/S load balancer.
This option combines seamlessly with the CDN options mentioned previously.

To reduce latency inside your applications, examine any remote procedure calls
(RPCs) that communicate between VMs. This latency typically occurs when
applications communicate between tiers or services. Tools such as Stackdriver
Trace can help you minimize latency caused by application-serving
requests.

Because TCP and SSL proxy are also based on GFE, the effect on latency is the
same as observed with HTTP Load Balancing. Because HTTP(S) Load Balancing has
more features than TCP/SSL proxy, we recommend always using HTTP(S) load
balancing for HTTP(S) traffic.

Next steps

We recommend that you deploy your application close to the majority of your
users and choose the best configuration for your use. For more information
about the different load balancing options on GCP, see the following
documents: