Cookbook Examples

The documentation on this page applies only to the Dataflow SDK 1.x for
Java.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.
See the documentation for those SDKs.

There are a number of example pipelines available in the
Cookbook Examples
directory (master-1.x branch) on GitHub. These pipelines illustrate how to use the SDK classes to implement common data processing design
patterns. This document briefly describes each example and provides a link to the source code.

See End-to-End Examples for a list of complete examples that
illustrate common end-to-end pipeline patterns using sample scenarios.

Important Keywords

The following terms will appear throughout this document:

Pipeline - A Pipeline is the code you write to
represent your data processing job. Dataflow takes the pipeline code and uses it to build a
job.

PCollection - PCollection is a special class
provided by the Dataflow SDK that represents a typed data set of virtually any size.

Transform - In a Dataflow pipeline, a transform
represents a step, or a processing operation that transforms data.

BigQueryTornadoes

The BigQueryTornadoes
pipeline reads public samples of weather data from BigQuery and counts the number of tornadoes
that occur each month.

The pipeline applies a user-defined transform
that takes rows from a table and generates a table of counts. The user-defined transform in this
example consists of several transforms, one of them being the SDK-defined
Count
Transform. Count is used here to count the total number of tornadoes found in
each month. The pipeline then writes the end results to BigQuery.

CombinePerKeyExamples

You'll often find that you need to combine values in your pipeline's data. The Dataflow SDK
provides a number of Combine
transforms that you can use to merge the values in your pipeline's PCollection
objects. Combine.perKey, for example, groups a collection of tuples by key and then
performs some combination of all the values that share that key. You can combine the values by
supplying Combine.perKey with one of the predefined
combine functions, or your own custom
combine function.

CombinePerKeyExamples
reads a set of Shakespeare's plays from Google BigQuery,
and for each word in the dataset that exceeds a given length, it generates a string containing the
list of play names in which that word appears. Specifically, it demonstrates how to
combine values in a key-grouped collection. It uses
Combine.perKey to generate a string containing the list of play names in which each
word that meets the length requirement appears. It does this using a (user-defined) combine
function that builds a comma-separated string of all input items. That is, for a collection of
key-value pairs where both key and value are strings, the function concats all string values,
grouped under the same key, into one string.

In addition to combining values in your pipeline's data, CombinePerKeyExamples
demonstrates the use of Aggregators.
Aggregators are used to track certain information to display in the
Dataflow Monitoring UI.
The CombinePerKeyExamples pipeline uses an Aggregator to track the
number of words that are shorter than the given length (and are, therefore, not included in the
final result).

DatastoreWordCount

DatastoreWordCount consists of two examples. The first example,
writeDataToDatastore, creates a pipeline that reads input from a text file and
populates DatastoreIO from it. The second example, readDataFromDatastore,
creates a pipeline that performs a DatastoreIO.Read to read Shakespeare's plays,
counts the number of times each word occurs in the plays, and writes the results to a text file.

DeDupExample

DeDupExample
demonstrates how to remove duplicate records from a data set. In the example, the pipeline's
input is a set of Shakespeare's plays as plain text files.

The example shows how to use the SDK-provided
RemoveDuplicates transform to remove the duplicate records (in this case, text lines).

FilterExamples

FilterExamples
demonstrates different approaches to filtering. One approach is
projection in the relational database sense, which derives a desired subset of the set
of all columns found in a table. Another class of filtering is when you output a collection of
elements based on whether some condition is true (a statically-defined value, a parameter defined
at pipeline creation time, or a value derived by the pipeline itself).

The FilterExamples pipeline reads public samples of weather data from
BigQuery. It performs a projection on the data, finds the global mean of the temperature readings,
filters on readings for a single given month, and then outputs only data (for that month) that
has a mean temperature smaller than the derived global mean.

Often, it's useful for a filter condition to be based on a value that's derived by the
pipeline itself. The derived value can be used as a
side input. FilterExamples shows how to filter on a side input; it contains a
a composite transform that reads weather data, extracts only the temperature value from each
element, and then uses that collection as the input to a global mean derivation. The result of
that derivation is then used to filter on a collection, outputting only elements with a
temperature below that global mean.

JoinExamples

JoinExamples
shows how to perform relational joins between
datasets by using Dataflow's CoGroupByKey transform. CoGroupByKey can
join values from multiple PCollections that share a common key.

JoinExamples uses sample data from two BigQuery tables, one with current events
indexed by country codes, and the other with country codes mapped to country names. The example
demonstrates joining the data from the two tables with a country code key.