Hadoop HDFS is typically adopted in situations where traditional storage and database systems are either reaching their limits or have already surpassed them. This usually implies that there are one or more large streams of events that need to be collected, such as log data streams. Flume NG was designed from the ground-up to tackle this problem in a straightforward, scalable, reliable way, and empirical results support the success of its approach.

At a high level, Flume NG has a simple well-designed architecture consisting of a set of agents with each agent running any number of sources, channels (event buffers), and sinks. Flume agents can easily be chained across the network to provide a configurable pipeline through which discrete events flow reliably from source (i.e., an application server) to destination (i.e., a Hadoop HDFS cluster). Flume can be configured to support arbitrary data flows, including fan-in (data aggregation) and fan-out (data replication) designs. Such designs are primarily an artifact of the generality of the agent-based architecture.

In this tutorial, a group of people closely involved with Flume walk participants through setting up a typical data collection infrastructure using Flume. We first describe the basic architecture of Flume including its design, the transactional semantics it supports for reliability, and the sources, channels, and sinks included with the Flume core. We then move on to a brief description of common data flow architectures, and choose a typical data collection scenario for which we use Flume to do the heavy lifting. Next we come to the main body of this tutorial session, which is a walkthrough of installing, configuring, and tuning a scalable, reliable, and fault-tolerant Flume-based data collection system for storing events into a Hadoop system in real time.

Throughout this presentation we also cover: (1) how to configure Flume to store data on a secure HDFS cluster, (2) configuration options used to trade off between performance and fault tolerance, (3) Avro support, (4) Flume extension points, plugins, and hooks, (5) Flume compatibility with various versions of Hadoop, (6) performance benchmarks, and (7) general best practices for using Flume NG effectively.

People planning to attend this session also want to see:

Hari Shreedharan

Cloudera Inc.

Hari is a Software Engineer at Cloudera, where he is working on building Apache Flume. Previously, Hari was a software engineer on Yahoo! Mail’s metadata indexing and query team. He holds a Masters from Cornell University in Computer Science.

Will McQueen

Cloudera Inc.

Arvind Prabhakar

Cloudera

Arvind is the PMC Chair for Apache Sqoop and a committer and PPMC member of Apache Flume. A seasoned enterprise software developer, Arvind has worked at Netscape, Sun Microsystems, Informatica and currently at Cloudera.