Ruby Guilds are the new way concurrency will be handled in Ruby 3. There’s still a long way to go until we reach that point, but I believe that we can already start implementing some of the concepts that will make our lives easier when we reach Ruby 3.

Note: this is not an article explaining what are Guilds and how do they work. You can read excellent explanations on that here:

Note 2: everything here is based on some assumptions which means that in few years the concept might look totally different. However it doesn’t mean that you shouldn’t use good practices or some recommendations that I describe below.

Guilds basic concepts in a TL;DR version

Guilds have at least one thread (and a thread have a fiber)

Threads in different guilds can run in parallel

Threads in a same guild can not run in parallel because of GIL/GVL/GGL (Giant Guild Lock)

A guild can’t access the objects of other guilds

Guilds are allowed to communicate with each other using channels (Guild::Channel)

Objects can be copied between guilds (deep copy)

Objects can be moved between guilds

Immutable objects (deeply immutable) can be shared between guilds

It sounds simple (and Koichi said that the initial concept implementation has only 400 lines), but it creates many problems that will have to be solved. I will try to cover some of them as they might have an impact on the overall performance of our code.

Don’t try to unlearn locking and multi threading Ruby 2 approach

TL;DR: you will still have to know how to deal with multi threading and it’s problems the way it is handled in Ruby 2.

GVL is insufficient to guard against data races on Ruby2 and this won’t change inside single guild with multiple threads. Since Ruby core team aims to make Ruby 3 compatible with Ruby 2 software (so the community won’t split with incompatible Ruby versions), any Ruby 2 software will run in Ruby 3 in a single Guild. So all your synchronization and locking problems won’t go away without an effort.

Scaling with guilds won’t be linear so don’t think it will solve all your problems

Guilds won’t be silver bullets. They will give Ruby programmers a new, great set of tools but they will for sure create some problems. If you hope that memory usage will drastically go down and that performance will go up, without you doing anything you might be really disappointed when new Ruby appears.

Objects owners mean more checking

TL;DR: the less you share the smaller the transferring overhead will be.

Objects will have guild owners. It means that Ruby will ahve to hve references to which guild an object belongs. One of the slides from Koichi’s presentation stathes that an exception will be thrown when trying to access object from other guild. It means that Ruby will have to have some sort of checker that will run either on:

every object access

every object access for objects that were transfered at least once (flag or something?)

every object access of an object that is not frozen and references in other guilds

Either way, there will be way more access checking. Ruby already checks the class of each object on it’s access, so maybe this could be combined together.

What that means for us? The less objects we will share between guilds the faster they will run.

Transferring ownership – moving references vs copying

TL;DR: It might be faster (and for sure safer) to start using immutable structures if you plan to transfer a lot of data in between guilds.

Programmers will definitely want to transfer ownership not only for simple objects but also for more complex (and big) data structures. It means that Ruby not only will have to move main objects but all subobjects (arrays of arrays of objects, etc).

And here a question emerges: wouldn’t it be better to just copy the whole structure instead of updating all the references? Is there even a programming language that has a GC and allows moving mutable objects directly (without deep copying) between threads?

Method cache will remain global

TL;DR: OpenStruct will be a worse idea than it is right now. Try even harder not to invalidate method cache too much.

When we redefine a method (or add new), method lookup will have to be performed an cache invalidation needs to occur on all the guilds. This means that we will have to stop execution of all the guilds at once (since there shouldn’t be a case when one runs on an old method version and another already uses new one).

Operations that redefine things will be either impossible or slower. It will impact also access (mutable constants access between guilds).

Summary

There isn’t much to summarize since for each section there’s a TL;DR but it’s worth pointing out that we need to be more cautions about sharing our data and about doing a lot of meta-programming beyond good practices (like redefining dynamically built constants) and everything should be fine.

Note: For the case of simplicity I skipped some of the corner cases to make this article less complicated and more understandable.

When working with multi threaded and multi process systems that consume messages in parallel, you need to be able to enforce some limitations on the processing order.

Most of the time, in well designed systems, things should be based on atomic (in terms of not being dependent on any other worker job) jobs that can run whenever they need to. However, in some cases you need to make sure that jobs related to a given resource are not executed at the same time. This can be achieved with Karafka and Sidekiq in 3 ways:

Standalone Karafka mode – single Karafka process will consume messages one by one and process them inline (without using Sidekiq at all). Since Kafka gives us a per partition ordering, same applies to standalone Karafka mode.

Karafka default mode + single Sidekiq worker with a single thread. Since single threaded Sidekiq process uses FIFO, this can be achieved with this combination.

Karafka + Sidekiq + SidekiqUniqueJobs gem. This combination allows us to build complex multi threaded and multi process workers that will ensure that a single resource is being modified by maximum 1 worker at the same time (while executing uniqueness).

In this article I will focus on the last option, as it is the best in terms of performance and flexibility.

A bit about state machines

Having a single resource on which many actions can be performed is always a problem. To ensure (on an object level) that we won’t process things in parallel we can use state machines (aasm is a great gem for that) to achieve locking. However, if we only do that, we will have to implement a bit more logic when considering a possibility of many jobs doing same stuff at the same time:

Detecting and waiting if someone else is already using (whatever that means) a given resource.

Retrying in case of state machine failure (invalid transitions).

But hey! We use Sidekiq! It means that retrying is already built in and it will restart the job later on. It also means, that at some point, the state machine lock will be removed and we will be able to finish all the jobs.

That is true, however we should not rely on that because a case like that shouldn’t be considered an exception. Instead we should prevent situations like that from happening at all.

Problem definition

Let’s say we have a RemoteResource representation that needs to be refreshed periodically (every 5 seconds) and we need to make sure that we never overwrite it’s content with older data.

class RemoteResource < ActiveRecord::Base
include AASM aasm do
state :initial, initial: true
state :running
event :run do
transitions :from => :initial, :to => :running
end
event :finish do
transitions :from => :running, :to => :initial
end
end
def refresh!
# Some external API calls and other business stuff
# @note I know that this shoould be in a service but
# again, lets keep things simple
update!(content: RemoteData.fetch(id))
end
end

Our processing (including external API call) does not take more time than 5 seconds

Sidekiq workers are up and running (at least matching the speed of enqueuing)

But lets kill all the workers for 1 minute and run them again. We end up with a queue full of messages that will be processed at the same time. And then we might have jobs for the same object enqueued one after another, which means that they will be executed at the same time:

In a case like that, both workers will overlap processing the same resource. It means that we might end up with outdated data (network delay for an older resource) or a state machine exception. In more complex cases it could mean corrupted data.

Since Karafka passes messages into Sidekiq, this problem can occur there as well. The worse part is that it can get much worse when we perform many types of tasks on a single resource and they should never overlap.

Solution: SidekiqUniqueJobs

SidekiqUniqueJobs is a great gem that solves exactly this issue. It makes sure (among other options) that for a given set of arguments, only a single job can run at the same time. In our case unique arguments would be:

Worker name

RemoteResource#id

This is enough to competeour task! Jobs for different resources will be processed at the same time, but no more parallel processing on the same RemoteResource.

Implementation in Karafka application

To use this gem with Karafka application, you need to build a uniqueness scope before a job is enqueued. This can be done inside the #before_enqueue block:

Note, that we use SecureRandom.uuid as a fallback for jobs for which we don’t want to make uniqueness execution. Random uuid will ensure that all the jobs without a uniqueness_scope can perform at the same time. If we would leave nil, it would be treated as any other value so no jobs would run in parallel.