Tag: kafka
(page 1 of 3)

Note: These release notes cover only the major changes. To learn about various bug fixes and changes, please refer to the change logs or check out the list of commits in the main Karafka repository on GitHub.

TL;DR

Note: Changes above don’t include Zeitwerk setup for your non-Rails projects. See this commit for details on how to replace Karafka::Loader with Zeitwerk.

Note: If you use Sidekiq backend, keep in mind that before an upgrade, you need to consume all of the messages that are already in Redis.

Note: This release is the last release with ruby-kafka under the hood. We’ve already started the process of moving to rdkafka-ruby.

Changes (features, incompatibilities, etc)

Auto-reload of code changes in development

Up until now, in order to see your code changes within the Karafka process, you would have to restart it. That was really cumbersome as for bigger and more complex Kafka clusters, restart with reconnections and rebalancing could take a significant amount of time. Fortunately, those times are already gone!

All you need to do is enabling this part of the code before the App.boot in your karafka.rb file:

# For non-Rails app with Zeitwerk loader
if Karafka::App.env.development?
Karafka.monitor.subscribe(
Karafka::CodeReloader.new(
APP_LOADER
)
)
end
# Or for Ruby on Rails
if Karafka::App.env.development?
Karafka.monitor.subscribe(
Karafka::CodeReloader.new(
*Rails.application.reloaders
)
)
end

and your code changes will be applied after each message/messages batch fetch.

Keep in mind though, that there are a couple of limitations to it:

Changes in the routing are NOT reflected. This would require reconnections and would drastically complicate reloading.

Any background work that you run, outside of the Karafka framework but still within, might not be caught in the reloading.

If you use in-memory consumer data buffering that spans across multiple batches (or messages in a single message fetch mode), it WON’T work as code reload means re-initializing all of the consumers instances. In cases like that. you will be better, not using the reload mode at all.

It is also worth pointing out, that if you have a code that should be re-initialized in any way during the reload phase, you can pass it to the Karafka::CodeReloader initializer:

Parsers are now Deserializers in the routing and accept the whole Karafka::Params::Params object

Parsers as a concept, that would be responsible for serialization and deserialization of data violated SRP (see details here). From now on, they are separate entities that you can use independently. For the upgrade, just rename parser to deserializer for each topic you’re using in the routes:

def consume
params_batch.each do |params|
puts "Hello #{params['login']}!\n"
end
end

Karafka used to merge your data directly within the Karafka::Params::Params object root scope. That was convenient, but not flexible enough. There are some metadata details in the root params scope that could get overwritten, plus in case you would send something else than a JSON hash, let’s say an array, you would get an exception and you would have to use a custom parser to bypass that (see this FAQ question).

Due to that and in order to better separate your incoming data from the rest of the payload (headers, metadata information, etc), from now on, all of your data will be available under the payload params key:

Usage

Once included in your RSpec setup, this library will provide you two methods that you can use with your specs:

– #karafka_consumer_for – this method will create a consumer instance for the desired topic. It needs to be set as the spec subject.
– #publish_for_karafka – this method will “send” message to the consumer instance.

Note: Messages sent using the `#publish_for_karafka` method won’t be sent to Kafka. They will be “virtually” delegated to the created consumer instance so your specs can run without Kafka setup.

New instrumentation called Karafka::Instrumentation::ProctitleListener has been added. Its purpose is to provide you with a nicer proc title with a descriptive value. In order to use it, please put the following line in your karafka.rb boot file:

Single consumer class supports more than one topic

Since now, you are able to use the same consumer class for multiple topics:

App.consumer_groups.draw do
consumer_group :default do
topic :users do
consumer UsersConsumer
end
topic :admins do
consumer UsersConsumer
end
end
end

Note: you will still have separate instances per each topic partition.

Delayed re-connection upon critical failures

If a critical failure occurs (network disconnection or anything similar) Karafka will back off and wait for reconnect_timeout (defaults to 10s) before attempting to reconnect. This should prevent you from being clogged by errors and logs upon serious problems.

Support for Kafka 0.10 dropped in favor of native support for Kafka 0.11

Support for Kafka 0.10 has been dropped. Weird things may happen if you decide to use Kafka 0.10 with Karafka 1.3 so just upgrade.

Reorganized responders – multiple_usage constrain no longer available

multiple_usage has been removed. Responders won’t raise any exception if you decide to send multiple messages to the same topic without declaring that. This feature was a bad idea and was creating a lot of trouble when using responders in a long-running, batched like flows.

Following code would raise a Karafka::Errors::InvalidResponderUsageError error in Karafka 1.2 but will continue to run in Karafka 1.3:

While Karafka is processing, ruby-kafka prebuffers more data under the hood in a separate thread. If you have a big consumer lag, this can cause your Karafka process to prebuffer hundreds or more megabytes of data upfront. Lowering the queue size makes Karafka more predictable by default.

Documentation

Our Wiki has been updated accordingly to the 1.3 status. Please notify us if you find any incompatibilities.

Getting started with Karafka

If you want to get started with Kafka and Karafka as fast as possible, then the best idea is to just clone our example repository:

Note: These release notes cover only the major changes. To learn about various bug fixes and changes, please refer to the change logs or check out the list of commits in the main Karafka repository on GitHub.

Note: 1.2 release is the last release that will require ActiveSupport to work.

Code quality

I will start with the same thing as with 1.1. We’re constantly working on having a better and easier code base. Despite many changes to our code-base stack, we were able to maintain a pretty decent offenses distribution and trends.

It’s worth pointing out, that we’re now using more extensively many components of the Dry-Rb ecosystem and we love it!

Performance

This release brings significant performance improvements allowing you to consume around 40-50k messages per second per single topic. We could do a bit more (around 5-10%) by using symbols as defaults for metadata params key names, but this would bring up a lot of complexity and confusion since JSON parsing returns string keys. It would also introduce some problematic incompatibilities when using additional backend engines that serialize the whole params_batch and deserialize it back.

Karafka is a complex piece of software and benchmarking it can be tricky. There are many use-cases that need to be considered. Some of them single threaded, some of them multi-threaded, some with a non-parsed data rejections and some requiring multiple-thread interactions. That’s why it is really hard to design a single benchmark that will be able to compare multiple Kafka + Ruby frameworks in a fair way.

We’ve decided not to go that way, but rather compare new releases with the previous once. Here are the results of running the same logic with 1.1 and 1.2 multiple times (the more the better):

For some edge cases, Karafka 1.2 can be up to 3x faster than 1.1.

If you are looking for some cross-framework benchmark results, they are available here.

Features

Controllers are now Consumers

Initial versions of Karafka were built with an idea, that we could ignore the transportation layer when working with data. Regardless whether it was an HTTP request, Kafka message or anything else, as long as the data is in a compatible format, we should not have to adapt our business logic to it.

That was the primary reason, why prior to Karafka 1.2 you would put logic in controllers that inherited from ApplicationController or KarafkaController. And this was a mistake.

More and more companies use Karafka within a typical Ruby on Rails stack in which controllers are meant to be Rails controllers. Less experienced developers that encounter Karafka controllers within Rails app/controllers namespace would often end up trying to use some Rails controllers API specific magic without realizing that they’re within Karafka controller scope. To eliminate this problem and to match Kafka naming conventions, the processing units that are responsible for feeding you with Kafka data are being renamed to Consumers and from now on, there are no controllers in the Karafka ecosystem.

New instrumentation engine using Dry-Monitor

Note: Dry-Monitor usage requires a separate article. Here’s just a brief summary of what we did with it.

Old Karafka monitor was too magical. It would auto-detect the context in which it is invoked, automatically building notification scopes and doing a lot of other things. This was really cool but it was:

Slow

Hard to maintain

Bug sensitive

Code change sensitive

Not isolated from the rest of the system

Hard to use with custom tools like NewRelic or Airbrake

Limited when it comes to instrumenting with multiple tools at the same time

Too custom to be easily replaced

We are proud to announce, that from now on, Dry-Monitor is the instrumentation backbone of the whole Karafka ecosystem. Here’s a simple example of what you can achieve using it:

and to be honest, possibilities are endless. From simple logging, through in-production performance monitoring up to multi-target complex instrumentation. Please refer to the Monitoring and logging section of Karafka Wiki for more details.

Dynamic Karafka::Params::Params parent class

Karafka is designed to handle a lot of messages. Each incoming message is wrapper with a lazy evaluated hash-like object. Prior to 1.2, each params object was built based on ActiveSupport::HashWithIndifferentAccess. Truth be told, it is not the fastest library ever (benchmark details here), especially when compared to a PORO Hash:

Now imagine that in some cases, we create 50 0000 objects like that per second. This had to have a serious impact on the framework performance. As always, there needs to be a trade-off. Should we go with a Hash in the name of performance or should we use HashWithIndifferentAccess for the sake of the “simplicity”? We will let you choose whatever you find more suitable.

For that reason, we’ve provided a config params_base_class setting that you can use to set up the base params class from which Karafka::Params::Params will inherit. By default, it is a plain Hash.

Keep in mind, that you can use other base classes like for example concurrent hash for your advantage. The only requirement is that it needs to have the same API as a Ruby Hash.

System callbacks reorganization with multiple callbacks support

Note: This will be unified with a one set of events that you will be able to hook up to in 1.3 using Dry-Events.

Due to the fact, that some of the things happen in Karafka outside of consumers scope, there are two types of callbacks available:

– Lifecycle callbacks – callbacks that are triggered during various moments in the Karafka framework lifecycle. They can be used to configure additional software dependent on Karafka settings or to do one-time stuff that needs to happen before consumers are created.
– Consumer callbacks – callbacks that are triggered during various stages of messages flow

You can read more about them and how to use them in the Callbacks wiki section.

This new callback will be executed once per each consumer group per process before we start receiving messages. This is a great place if you need to use Kafka’s #seek functionality to reprocess already fetched messages again.

Note: Keep in mind, that this is a per process configuration (not per consumer) so you need to check if a provided consumer_group (if you use multiple) is the one you want to seek against.

class App < Karafka::App
# Setup and other things...
# Moves the offset back to 100 message, so we can reprocess messages again
# @note If you use multiple consumers group, make sure you execute #seek on a client of
# a proper consumer group not on all of them
before_fetch_loop do |consumer_group, client|
topic = 'my_topic'
partition = 0
offset = 100
if consumer_group.topics.map(&:name).include?(topic)
client.seek(topic, partition, offset)
end
end
end

Rewritten NewRelic client

Thanks to NewRelic kindness, we were able to rewrite the whole listener that now can collect various information about the Karafka data flow. It is super easy to use and extend. You can find it in the Monitoring and Logging wiki section.

All metadata keys are strings by default

Since now the default params class is a Hash, we had to pick either symbols or strings as key names for all the metadata attributes. We’ve decided to go with strings as they are more serialization friendly and cooperate with various backends used with Karafka.

Note: If you use HashWithIndifferentAccess, nothing really changes for you.

JSON parsing defaults now to string keys

Since there is no indifferent access by default, when lazy parsing the JSON Kafka data, it will default to string keys that will be merged to the params object. If you’re not planning to use the HashWithIndifferentAccess make sure that your code-base is ready for this change.

Configurators removed in favor of the after_init block configuration

For additional setup and/or configuration tasks you can create custom configurators. Similar to Rails these are added to a config/initializers directory and run after app initialization.

Due to a changed lifecycle of Karafka process, more things are being built dynamically upon boot. This means that in order to run initializers in a good way, we would have to control the load order in a more granular way. That’s why this functionality has been replaced with an after_init callback declaration:

class App < Karafka::App
# Setup and other things...
# Once everything is loaded and done, assign Karafka app logger as a Sidekiq logger
# @note This example does not use config details, but you can use all the config values
# to setup your external components
after_init do |_config|
Sidekiq::Logging.logger = Karafka::App.logger
end
end

Note: you can have as many callbacks of any type as you want to. They also can be objects as long as the respond to a #call method.

Karafka ecosystem gems versioning convention

Karafka is combined from several independent libraries. The most important are:

Karafka – The main gem that is used to build Karafka applications that consume messages

Some Karafka users had problems using mismatched versions of those gems. From now on, they all will be released in sync up to the second version point. It means that if you decide to use Karafka 1.2 with other ecosystem libraries, you should match them to 1.2.* as well.

Note: This should be resolved automatically as we locked all the proper versions within gemspec, but still worth mentioning.

Documentation

Our Wiki has been updated accordingly to the 1.2 status. You probably may want to look at the rewritten Monitoring and logging section and the new Testing guide that illustrates how you can test various Karafka ecosystem components.

Default monitor and logger update

Please refer to the Monitoring and logging Wiki section for details of the way both of those things work now. If you used the default monitoring and logging without any customization, all you need to do is add this to your karafka.rb file after the setup part:

Karafka.monitor.subscribe(Karafka::Instrumentation::Listener)

NewRelic client update

If you use our NewRelic example client, please take a look at the new one and upgrade accordingly.

Callbacks rename

class ExamplesConsumer < Karafka::BaseConsumer
include Karafka::Consumers::Callbacks
# Rename this
after_fetched do
# Some logic here
end
# To this
after_fetch do
# Some logic here
end
end

Karafka params received_at renamed to receive_time

Again, just a name change: if you use ‘received_at’ params timestamp, you’ll enjoy it under the ‘receive_time’ key.

Getting started with Karafka

If you want to get started with Kafka and Karafka as fast as possible, then the best idea is to just clone our example repository: