Apache Pulsar reaches 2.0

May 31, 2018

Matteo Merli

We here at Streamlio are very excited to see the announcement of the release of Apache Pulsar 2.0. This release is the culmination of several months of work that have brought huge improvements to Pulsar, including an array of new features and performance improvements, as well as easier onboarding for new users. In Pulsar 2.0 you’ll see:

The Pulsar Functions feature provides the easiest possible way to implement application-specific processing logic of any complexity and execute it in a managed runtime. Developers only need to supply a “function” in their preferred language (Java and Python are supported in the 2.0 release, with more to come later) and Pulsar will handle all of the orchestration and execution. Pulsar Functions is similar to Lambda-style functions but specifically designed to use Pulsar as a message bus.

Developers will love Pulsar Functions because they require minimal boilerplate and are easy to reason about, and operators will love them because they provide a processing engine inside of Pulsar, which frees them from the need to stand up a separate system.

Schema registry

The introduction of native support for topic schemas in Pulsar means that you can declare how message data looks and have Pulsar enforce that producers can only publish valid data on the topics.

In particular, unlike other systems, the schema registry is tightly integrated within Pulsar, and this opens up a new set of possibilities, from consumers now being able to auto-discover the schema to being able to visualize data with generic tools to having building blocks in place to ensure end-to-end type checking on Pulsar topics.

Another advantage of schemas is that the client library can now take care of the serialization/deserialization automatically, as the Pulsar client API has been extended to be “type safe.” With topic schemas you can simply publish objects and have Pulsar clients serialize them internally using the previously defined schema.

Topic compaction

The topic compaction feature allows consumers to request a “snapshot” version of a topic that contains only the last messages published with specific keys (rather than all messages that contain those keys). A typical use case that can benefit from compaction would be rebuilding the state of a key-value store from scratch using the data stored in the topic. Bootstrapping a Redis cache with the latest values and keeping that cache up to date with all new messages, for example, would greatly benefit from compacted topics, as the consumer would need to rewind only through the “relevant” messages rather than the “outdated” messages.

Topic compaction separates this “snapshot” version of a topic from the regular stream of messages, using BookKeeper ledgers to store the compacted snapshot. This is a distinctive feature in Pulsar and it enables to support two different types of consumers simultaneously:

One set of consumers that wants to receive the snapshot and subsequent updates in the topic (or stream). These consumers can read from the compacted topic.

Another set of consumers that wants to receive every message published on the topic. These consumers can read from the non-compacted topic.

Performance improvements

A lot of emphasis has been placed on ensuring that Pulsar (and BookKeeper) can take maximum advantage of hardware resources across multiple configurations and even under a wide range of traffic patterns.

We have done a lot work remove all obstacles on the data path, including removing inter-thread contention to ensure that Pulsar can achieve high throughput under different conditions (single partition or many partitions), even when the message payloads are very small and batching cannot be applied.

One example of a significant change is replacing the Prometheus latency metrics collector library due to its impact on allocation rate and mutex contention. In its place, we’ve implemented a custom collector that exports data in Prometheus format and collects the same latency metrics but with zero allocations and no contention between threads.

Apache BookKeeper 4.7

Pulsar uses BookKeeper as its distributed storage system for message data. Prior to 2.0, Pulsar used a version of BookKeeper based on release 4.3.1 with numerous modifications from Yahoo, several of them geared toward improving performance. In the past year, we made a huge effort to merge all of these changes back into mainstream BookKeeper so that Pulsar now depends on the latest mainstream BookKeeper (version 4.7).

With this BookKeeper upgrade, Pulsar is much better poised to immediately take advantage of new improvements and exciting new features being worked on by the BookKeeper community.

Compatibility

Even though Pulsar 2.0 is a major version release, we have made sure that both forward and backward compatibility between clients and brokers are always respected.

Not only it is possible to do a live upgrade with no downtime from Pulsar 1.x to 2.0, but it is also possible to downgrade from 2.0 to 1.x as a safety measure.

When deploying a release over multiple geo-replicated clusters, it’s critical to be able to have both:

1.x clients talking to 2.0 brokers

2.0 clients talking to 1.x brokers

This is something that was established in Pulsar’s protocol from day one, ensuring that any client version can successfully operate with newer brokers without requiring a forced upgrade in client applications.

Conclusion

Apache Pulsar 2.0 is a significant milestone, bearing lots of great new functionality that addresses numerous drawbacks in existing messaging systems. In addition to the aforementioned new features and improvements, we believe that the Pulsar 2.0 release lays the foundation for even more new functionality in subsequent releases. We look forward to seeing Pulsar develop further, bringing production-ready, best-of-breed messaging to unprecedented heights.

*Apache Heron is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by Apache Incubator PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.