In a separate terminal, connect to the agent via netcat and enter a series of messages.

> nc localhost 11111
> testing
> 1
> 2
> 3

In the agent logs, an HDFS file will be created and committed every 30 seconds. If I print the contents of the files to standard out using HDFS cat, I'll find the messages from netcat are stored.

More Advanced Deployments

Regardless of source, direct writers to HDFS are too simple to be suitable for many deployments: Application servers may reside in the cloud while clusters are on-premise, many streams of data may need to be consolidated, or events may need to be filtered during transmission. Fortunately, Flume easily enables reliable multi-hop event transmission. Fan-in and fan-out patterns are readily supported via multiple sources and channel options. Additionally, Flume provides the notion of interceptors, which allow the decoration and filtering of events in flight.

Multi-Hop Topologies

Flume provides multi-hop deployments via Apache Avro-serialized RPC calls. For a given hop, the sending agent implements an Avro sink directed to a host and port where the receiving agent is listening. The receiver implements an Avro source bound to the designated host-port combination. Reliability is ensured by Flume's transaction model. The sink on the sending agent does not close its transaction until receipt is acknowledged by the receiver. Similarly, the receiver does not acknowledge receipt until the incoming event has been committed to its channel.

Fan-In and Fan-Out

Fan-in is a common case for Flume agents. Agents may be run on many data collectors (such as application servers) in a large deployment, while only one or two writers to a remote Hadoop cluster are required to handle the total event throughput. In this case, the Flume topology is simple to configure. Each agent at a data collector implements the appropriate source and an Avro sink. All Avro sinks point to the host and port of the Flume agent charged with writing to the Hadoop cluster. The agent at the Hadoop cluster simply configures an Avro source on the designated host and port. Incoming events are consolidated automatically and are written to the configured sink.

Fan-out topologies are enabled via Flume's source selectors. Selectors can be replicating  sending all events to multiple channels  or multiplexing. Multiplexed sources can be partitioned by mappings defined on events via interceptors. For example, a replicating selector may be appropriate when events need to be sent to HDFS and to flat log files or a database. Multiplexed selectors are useful when different mappings should be directed to different writers; for example, data destined for partitioned Hive tables may be best handled via multiplexing.

Interceptors

Flume provides a robust system of interceptors for in-flight modification of events. Some interceptors serve to decorate data with metadata useful in multiplexing the data or processing it after it has been written to the sink. Common decorations include timestamps, hostnames, and static headers. It's a great way to keep track of when your data arrived and from where it came.

More interestingly, interceptors can be used to selectively filter or decorate events. The Regex Filtering Interceptor allows events to be dropped if they match the provided regular expression. Similarly, the Regex Extractor Interceptor decorates event headers according to a regular expression. This is useful if incoming events require multiplexing, but static definitions are too inflexible.

Conclusion

There are lots of ways to acquire Big Data with which to fill up a Hadoop cluster, but many of those data sources arrive as fast-moving streams of data. Fortunately, the Hadoop ecosystem contains a component specifically designed for transporting and writing these streams: Apache Flume. Flume provides a robust, self-contained application which ensures reliable transportation of streaming data. Flume agents are easy to configure, requiring only a property file and an agent name. Moreover, Flume's simple source-channel-sink design allows us to build complicated flows using only a set of Flume agents. So, while we don't often address the process of acquiring Big Data for our Hadoop clusters, doing so is as easy and fun as taking a log ride.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!