Processing Logs at Scale Using Cloud Dataflow

Google Cloud Platform provides the scalable infrastructure you need to handle
large and diverse log-analysis operations. In this solution, you will learn how
to use Cloud Platform to build analytical pipelines that process log entries
from multiple sources. You will combine the log data in ways that help you
extract meaningful information and persist insights derived from the data, which
can be used for analysis, review, and reporting.

Overview

As your application grows more complex, gleaning insights from the data captured
in your logs becomes more challenging. Logs come from an increasing number of
sources, so they can be hard to collate and query for useful information.
Building, operating, and maintaining your own infrastructure to analyze log data
at scale can require extensive expertise in running distributed systems and
storage. This kind of dedicated infrastructure often represents a one-time
capital expense, resulting in fixed capacity, which makes it hard to scale
beyond the initial investment. These limitations can impact your business
because they lead to slowdowns in generating meaningful, actionable insights
from your data.

This solution shows you how to move past these limitations by using Cloud
Platform. In this solution, a set of sample microservices run on
Google Kubernetes Engine
to implement a website.
Stackdriver Logging
collects logs from these services and then saves them to
Google Cloud Storage
buckets.
Google Cloud Dataflow
then processes the logs by extracting metadata and computing basic aggregations.
The Cloud Dataflow pipeline is designed to process the log elements daily to
generate aggregate metrics for server response times, based on the logs for each
day. Finally, the output from Cloud Dataflow is loaded into
Google BigQuery
tables, where it can be analyzed to provide business intelligence. This solution
also explains how you can change the pipeline to run in streaming mode, for
low-latency, asynchronous log processing.

The included tutorial provides a sample Cloud
Dataflow pipeline, a sample web application, configuration information, and
steps to run the sample.

About the app

The sample deployment models a shopping app. In this sample, users can visit the
homepage of a retail site, browse for individual products, and then try to
locate the products in nearby brick-and-mortar stores. The app consists of three
microservices: HomeService, BrowseService, and LocateService. Each service
is available from an API endpoint in a shared namespace. Users access the
services by appending /home, /browse, and /locate to the base URL.

The application is configured to log incoming HTTP requests to stdout.

Using Kubernetes Engine with Stackdriver Logging

In this example, the microservices run in a Kubernetes Engine cluster, which is a
group of
Google Compute Engine
instances, or nodes, that run
Kubernetes.
By default, Kubernetes Engine configures each node to provide a number of
services, including monitoring, health checking, and centralized logging. This
solution uses this built-in support for Stackdriver Logging to send logs from each
microservice to Cloud Storage. As an alternative for applications that log
information to files, not covered by this solution, you can configure
cluster-level logging with Kubernetes.

Each microservice runs on an individual pod in
the cluster. Each pod runs on a node and is exposed as a single HTTP endpoint by
using Kubernetes Engine
services.

Each node in the cluster runs a
Stackdriver Logging agent
that captures the log messages. After the logs become available in
Stackdriver Logging, a script automatically
exports the logs
to a Cloud Storage bucket by using the
Stackdriver Logging
support available from the
Cloud SDK.
In logging.sh, the sample runs commands similar to the following examples:

Note that you can also configure logs to be exported to Cloud Storage by using
the
Logs Viewer.
This solution uses the SDK because it's required when exporting multiple logs.

When using Cloud Storage as a logs export destination, log entries of type
LogEntry
are saved in hourly batches in individual JSON files. These structured Cloud
Logging entries include additional metadata that specifies when each log message
was created, which resource or instance generated it, what its severity level
is, and so on. In the following example of a Stackdriver Logging entry, in the
structPayload.log element you can see the original log message that the
microservice generated:

Creating the Cloud Dataflow pipeline

Cloud Dataflow is a simple, yet powerful system that can be used for many kinds
of data processing tasks. The
Dataflow SDK
provides a unified data model that can represent a data set of any size,
including an unbounded or infinite data set from a continuously updating data
source; it's well-suited for working with the log data in this solution. The
Dataflow managed service
can run both batch and streaming jobs. This means that you can use a single
codebase for asynchronous or synchronous, real-time, event-driven data
processing.

The Dataflow SDK provides simple data representations through a specialized
collection class named
PCollection.
The SDK provides built-in and custom
data transforms
through the PTransform class. In Cloud Dataflow, transforms represent the
processing logic of a pipeline. Transforms can be used for a variety of
processing operations such as joining data, computing values mathematically,
filtering data output, or converting data from one format to another. For more
information about pipelines, PCollections, transforms, and I/O sources and
sinks, see Dataflow Programming Model.

The following diagram shows the pipeline operations for the log data stored in
Cloud Storage:

Though the diagram might look complex, Cloud Dataflow makes it easy to build and
use the pipeline. The following sections describe the specific operations at
each stage of the pipeline.

Receiving the data

The pipeline starts by consuming input from the Cloud Storage buckets that
contain the logs from the three microservices. Each collection of logs becomes a
PCollection of String elements, where each element corresponds to a single
LogEntry object. In the following snippet, homeLogs, browseLogs, and
locateLogs are of type PCollection<String>:

To deal with the challenges of a continuously updating data set, the Dataflow
SDK uses a technique called
windowing.
Windowing works by logically subdividing the data in a PCollection according to
the timestamps of its individual elements. Because the source type is
TextIO
in this case, all the objects are initially read into a
single global window,
which is the default behavior.

Collecting the data into objects

The next step combines the individual microservice PCollections into a
single PCollection by using the Flatten operation.

This operation is useful because each source PCollection contains the same data
type and uses the same, global windowing strategy. Although the sources and
structure of each log are the same in this solution, you could extend this
approach into one where source and structure are different.

With a single PCollection created, you can now process the individual String
elements by using a custom transform that performs several steps on the log
entry. Here are the steps:

Extract the following individual fields from the log message by using regular
expressions: timestamp, responseTime, httpStatusCode, httpMethod,
source IP address, and destination endpoint. Use these fields to create a
timestamped LogMessage custom object.

Aggregating the data by days

Recall that the goal is to process the elements on a daily basis to generate
aggregated metrics based on logs for each day. To achieve this aggregation
requires a windowing function that subdivides the data by day, which is possible
because each LogMessage in the PCollection has a timestamp. After Cloud
Dataflow partitions the PCollection along daily boundaries, operations that
support windowed PCollections will honor the windowing scheme.

First, a transform takes LogMessage objects as input and then outputs a
PCollection of key-value pairs that map the destination endpoints as keys to
response time values. Using that PCollection, you can compute two aggregate
metrics: maximum response time per destination and average response time per
destination. Because the PCollection is still partitioned by day, the output of
each computation will represent a single day’s log data. This means that you
will have two final PCollections: one containing maximum response time per
destination per day and one containing average response time per destination per
day.

Loading the data into BigQuery

The final step in the pipeline outputs the resulting PCollections to BigQuery
for downstream analysis and data warehousing.

First, the pipeline transforms the PCollection that contains LogMessage
objects for all log sources into a PCollection of BigQuery TableRow objects.
This step is required in order to leverage the built-in support in Cloud
Dataflow to use BigQuery as a sink for a pipeline.

BigQuery tables require defined schemas. For this solution, the schemas are
defined in LogAnalyticsPipelineOptions.java by using a default-value
annotation. For example, the schema for the maximum-response-time table is
defined as follows:

An operation on the PCollections that contain the aggregated response-time
values converts them into PCollections of TableRow objects, applying the
appropriate schemas and creating the tables, if they're missing.

This solution always appends new data to the existing data. This is an
appropriate choice because this pipeline runs on a periodic basis to analyze new
log data. However, it's possible to truncate the existing table data or only
write to the table if it's empty, if one of those options makes more sense in
a different scenario.

Querying the data

The BigQuery console lets you run queries against the output data and connect to
third-party business intelligence tools such as
Tableau
and
QlikView
for additional analysis.

Using a streaming pipeline

The sample includes support for running the pipeline in either batch or
streaming mode. It takes only a few steps to change the pipeline from batch to
streaming. First, the Stackdriver Logging setup exports logging information to Cloud
Pub/Sub instead of Cloud Storage. The next step is to change the input sources
in the Cloud Dataflow pipeline from Cloud Storage to Cloud Pub/Sub topic
subscriptions. You need one subscription for each input source.

You can see the SDK commands in use in logging.sh.

PCollections created from Cloud Pub/Sub input data use an unbounded global
window. However, the individual entries already include timestamps. This means
that it's not necessary to extract timestamp data from the Stackdriver Logging
LogEntry object; just extract the log timestamps to create the custom
LogMessage objects.

The rest of the pipeline stays as-is, including downstream flatten,
transformation, aggregation, and output operations.

Running the pipeline

The steps to configure, build, and deploy this Cloud Dataflow pipeline, along
with the steps required to deploy the microservices and configure Stackdriver Logging
exports can be found in the tutorial. When you run
the Cloud Dataflow job, you can use the Google Cloud Platform Console to monitor progress
and view information about each stage in the pipeline.

The following image shows the console user interface while running an example
pipeline:

Extending the solution

The pipeline and set of operations described in this solution can be extended in
a variety of different ways. The most obvious extensions would be to perform
additional aggregations across the LogMessage data. For example, if session or
anonymized user information were included in the log output, you could create
aggregations around user activity. You could also use the
ApproximateQuantiles
transform to generate a distribution of response times.

BigQuery for storing processing output and supporting rich queries on that output

The cost of running this tutorial will vary depending on run time. Use the
pricing calculator to generate a cost estimate based
on your projected usage. New Cloud Platform users may be eligible for a
free trial.