Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Apache projects like Kafka and Spark continue to be popular when it comes to stream processing. Engineers have started integrating Kafka with Spark streaming to benefit from the advantages both of them offer. This webinar discusses the advantages of Kafka, different components and use cases along with Kafka-Spark integration. The above video includes the following...Read More

Apache projects like Kafka and Spark continue to be popular when it comes to stream processing. Engineers have started integrating Kafka with Spark streaming to benefit from the advantages both of them offer. This webinar discusses the advantages of Kafka, different components and use cases along with Kafka-Spark integration.

The above video includes the following topics. You can check out the presentation at the end of the post:

What is Kafka?

Why do we need Kafka?

Kafka components

How does Kafka work

Which companies are using Kafka?

Kafka and Spark integration

What is Kafka & why do we need it?

Kafka is an open-source message broker project developed by the Apache Software Foundation and is written in Scala. Kafka’s objective is to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka’s design is predominantly based on transaction logs.

Kafka is used in scenarios where its features like scalability, data partitioning, low latency, and the ability to handle large data can be utilized. Basically, Kafka is a good player when it comes to data integration related use cases.

Kafka is adopted in the following use cases:

Transportation of logs

Activity stream in real-time

CPU/IO/Memory usage

Time taken to load a web page

Time taken by multiple services while building a web page

Number of requests

Number of hits on a particular page/URL

Why LinkedIn built Kafka?

Kafka was originally developed at LinkedIn and became an Apache project in July 2011. It is used by LinkedIn for applications including log aggregation, queuing, and real-time monitoring and event processing.

Kafka benchmarks

LinkedIn relies heavily on the scalability and reliability of Kafka and a surrounding ecosystem of both open source and internal components. A test was conducted where the Kafka cluster was set up on three machines. The six drives are directly mounted with no RAID (JBOD style).

How fast is Kafka?

Kafka works up to two million writes/sec on three cheap machines. This was done using three producers on three different machines, 3x async replication and only one producer/machine because NIC was already saturated. It has shown sustained throughput as stored data grows.

Why is Kafka so fast?

Fast writes:

Apache Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM and the hardware specifications and OS are tuned in such a way that it is fast.

Fast reads:

Kafka efficiently transfers data from page cache to a network socket, through sendfile() system call in Linux.

The Combination of fast writes and fast reads have resulted in fast Kafka!

On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.

Example: According to loggly.com, which run Kafka on Amazon AWS, “99.99999% of the time the data is coming from disk cache and RAM; only very rarely do we hit the disk.”

“One of our consumer groups (8 threads) which maps a log to a customer can process about 200,000 events per second draining from 192 partitions spread across three brokers.”

Kafka Components

In Kafka, a message stream is defined by a topic, divided into one or more partitions. Replication happens at the partition level and each partition has one or more replicas.

The replicas are assigned evenly to different servers (called brokers) in a Kafka cluster. Each replica maintains a log on disk. Published messages are appended sequentially in the log and each message is identified by a monotonically increasing offset within the log.

When a consumer subscribes to a topic, it keeps track of an offset in each partition for consumption and uses it to issue fetch requests to the broker.

Kafka adoption and use cases

Here’s a list of companies that have adopted Kafka and their corresponding use cases:

LinkedIn – Activity streams, operational metrics, data bus

Netflix – Real-time monitoring and event processing

Twitter – Part of their Storm real-time data pipelines

Spotify – Log delivery (from 4h down to 10s), Hadoop

Loggly – Log collection and processing

Mozilla – Telemetry data

Other companies like Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber and many more have also adopted Kafka.

Webinar Presentation

Questions asked during the webinar

1. Does Kafka help to build lineage? No, Kafka will just publish the data.2. What is the example configuration of the scalable system? It depends on the size of the entire message which is going to be passed. Plus the RAM configuration available at the nodes where you are going to setup Kafka.3. Is aggregation possible on Kafka? Aggregation is not something that Kafka was designed to do.

Got a question for us? Please mention them in the comments section and we will get back to you.