This article describes an architecture for optimizing large-scale analytics
ingestion on Google Cloud Platform (GCP). For the purposes of this article, 'large-scale'
means greater than 100,000 events per second, or having a total aggregate event
payload size of over 100 MB per second.

You can use GCP's elastic and scalable managed services to
collect vast amounts of incoming log and analytics events, and then process them
for entry into a data warehouse, such as
BigQuery.

Any architecture for ingestion of significant quantities of analytics data
should take into account which data you need to access in near real-time and
which you can handle after a short delay, and split them appropriately. A
segmented approach has these benefits:

Log integrity. You can see complete logs. No logs are lost due to
streaming quota limits or sampling.

Cost reduction. Streaming inserts of events and logs are billed at a
higher rate than those inserted from Cloud Storage using batch jobs.

Reserved query resources. Moving lower-priority logs to batch loading
keeps them from having an impact on reserved query resources.

The following architecture diagram shows such a system, and introduces the
concepts of hot paths and cold paths for ingestion:

Architectural Overview

After ingestion from either source, based on the latency requirements of the
message, data is put either into the hot path or the cold path. The hot path
uses streaming input, which can handle a continuous dataflow, while the cold
path is a batch process, loading the data on a schedule you determine.

Logging events

You can use
Stackdriver Logging
to ingest logging events generated by standard operating system logging
facilities. Stackdriver logging is available in a number of Compute Engine
environments by default, including the standard images, and can also be installed
on many operating systems by using the
Stackdriver Logging Agent.
The logging agent is the default logging sink
for App Engine and Kubernetes Engine.

Hot path

In the hot path, critical logs required for monitoring and analysis of your
services are selected by specifying a filter in the
Stackdriver Logging sink
and then streamed to
multiple BigQuery tables.
Use separate tables for ERROR and WARN logging levels, and then split further
by service if high volumes are expected. This best practice keeps the number of
inserts per second per table under the 100,000 limit and keeps queries against
this data performing well.

Cold path

For the cold path, logs that don't require near real-time analysis are selected
using a Stackdriver Logging sink pointed at a Cloud Storage bucket.
Logs are batched and written to log files in Cloud Storage hourly batches.
These logs can then be batch loaded into
BigQuery using the standard Cloud Storage file import process, which can be
initiated using the Google Cloud Platform Console, the command-line interface (CLI), or
even a simple script. Batch loading does not impact the hot path's streaming
ingestion nor query performance. In most cases, it's probably best to merge cold
path logs directly into the same tables used by the hot path logs to simplify
troubleshooting and report generation.

Analytics events

Analytics events can be generated by your app's services in GCP
or sent from remote clients. Ingesting these analytics events through
Cloud Pub/Sub and then processing them in Cloud Dataflow provides a high-throughput
system with low latency. Although it is possible to send the hot and cold
analytics events to two separate Cloud Pub/Sub topics, you should send all events
to one topic and process them using separate hot- and cold-path Cloud Dataflow jobs.
That way, you can change the path an analytics event follows by updating the
Cloud Dataflow jobs, which is easier than deploying a new app or client version.

Hot path

Some events need immediate analysis. For example, an event might indicate
undesired client behavior or bad actors. You should cherry pick such events from
Cloud Pub/Sub by using an autoscaling Cloud Dataflow
job and then
send them directly to BigQuery.
This data can be partitioned by the Dataflow job to ensure that the 100,000 rows
per second limit per table is not reached. This also keeps queries performing well.

Cold path

Events that need to be tracked and analyzed on an hourly or daily basis, but
never immediately, can be pushed by Cloud Dataflow to objects on Cloud Storage. Loads
can be initiated from Cloud Storage into BigQuery by using the Cloud Platform Console,
the CLI, or even a simple script. You can merge them into the same tables as the
hot path events. Like the logging cold path, batch-loaded analytics events do
not have an impact on reserved query resources, and keep the streaming ingest path load
reasonable.