Monitoring overview

Viewing metrics

Go to Stackdriver in
Google Cloud Platform Console to view Stackdriver monitoring dashboards or to define
Stackdriver alerts. You can also use the Stackdriver monitoring API
to query and view metrics for subscriptions and topics.

Metrics and resource types

To see the usage metrics that Cloud Pub/Sub reports to
Stackdriver, view the Metrics List
in the Stackdriver documentation.

Note that these metrics are in bytes, whereas quota is measured in kilobytes.

Keeping subscribers healthy

Monitoring the backlog

To ensure that your subscribers are keeping up with the flow of messages, create
a dashboard that shows the following metrics, aggregated by resource, for all
your subscriptions.:

subscription/num_undelivered_messages

subscription/oldest_unacked_message_age

Create alerts that will fire when these values are unusually large
in the context of your system. For instance, the absolute number of undelivered
messages is not necessarily meaningful. A backlog of a million messages might be
acceptable for a million message-per-second subscription, but unacceptable for a
one message-per-second subscription.

Symptoms

Problem

Solutions

Both the oldest_unacked_message_age and
num_undelivered_messages are growing in tandem.

Subscribers not keeping up with message volume

Add more subscriber threads or processes.

Add more subscriber machines or containers.

Look for signs of bugs in your code that prevent it from successfully
acknowledging messages or processing them in a timely fashion
(see Monitoring ack deadline expiration).

If there is a steady, small backlog size combined with a steadily
growing oldest_unacked_message_age, there may be a small
number of messages that cannot be processed.

Stuck messages

Examine your application logs to understand whether some messages are
causing your code to crash. It's unlikely—but possible —that
the offending messages are stuck on Cloud Pub/Sub rather than in
your client. Raise a support case once
you are confident your code successfully processes each message.

Set up an alert that fires well in advance of the subscription's message
retention duration lapsing.

Monitoring ack deadline expiration

In order to reduce end-to-end latency of message delivery,
Cloud Pub/Sub allows subscriber clients a limited amount of time to
acknowledge a given message (known as the "ack deadline") before re-delivering
the message. If your subscribers take too long to acknowledge messages, the
messages will be re-delivered, resulting in the subscribers seeing duplicate
messages. This can happen for a number of reasons:

Your subscribers are under-provisioned (you need more threads or machines).

Each message takes longer to process than the message acknowledgement
deadline. Google Cloud Platform Client Libraries generally extend the
deadline for individual messages up to a configurable maximum. However, a
maximum extension deadline is also in effect for the libraries.

Some messages consistently crash the client.

It can be useful to measure the rate at which subscribers miss the ack deadline.
The specific metric depends on the subscription type:

Excessive ack deadline expiration rates can result in costly inefficiencies in
your system. You pay for every redelivery and for attempting to process each
message repeatedly. Conversely, a small expiration rate (for example, 0.1-1%)
might be healthy.

Monitoring push subscriptions

For push subscriptions, you should also monitor these metrics:

subscription/push_request_count

Group the metric by response_code and subcription_id.
Since Cloud Pub/Sub push subscriptions use
response codes as implicit message acknowledgements, it is important to
monitor push request response codes. Because push subscriptions exponentially
back off when
they encounter timeouts or errors, your backlog can grow quickly based on
how your endpoint responds.

Consider setting an alert for high error rates (create a metric
filtered by response class), since those rates lead to slower delivery and a
growing backlog. However, push request counts are likely to be more useful as
a tool for investigating growing backlog size and age.

subscription/num_outstanding_messages

Cloud Pub/Sub generally limits the number of outstanding messages. You should aim for fewer than 1000 outstanding messages
in most situations. As a rule, the service adjusts the limit based on the
overall throughput of the subscription in increments of 1000, once the
throughput achieves a rate on the order of ten thousand messages per second.
No specific guarantees are made beyond the maximum value, so 1000 is a good
guide.

subscription/push_request_latencies

This metric helps you understand your push endpoint's response latency distribution.
Because of the limit on the number of outstanding messages, endpoint latency
affects subscription throughput. If it takes 100 seconds to process each
message, your throughput limit is likely to be 10 messages per second.

Keeping publishers healthy

The primary goal of a publisher is to persist message data quickly. Monitor this
performance using topic/send_request_count, grouped by
response_code. This metric gives you an indication of whether
Cloud Pub/Sub is healthy and accepting requests.

A background rate of retryable errors (significantly lower than 1%) should
not be a cause for concern, since most GCP Client Libraries
retry message failures. You should investigate error rates that are greater
than 1%. Because non-retryable codes are handled by your application (rather
than the client library), you should examine response codes. If your publisher
application does not have a good way of signaling an unhealthy state, consider
setting an alert on the send_request_count metric.

It is equally important to track failed publish requests in your publish client.
While client libraries generally retry failed requests, they do not guarantee
publication. Refer to Publishing messages for
ways to detect permanent publish failures when using GCP Client
Libraries. At a minimum, your subscriber should log permanent publish errors. If
you log those errors to Stackdriver Logging, you can set up a
logs-based metric with an alert.