@kuujo@jhalterman thank you very much for the detailed info! I'm looking forward to experiment more with it. for some background. I work at Elastic and looking to see how we can use atomix with Logstash to build a clustered version of it.

@electrical that's awesome to hear! We're seeing a ton of interest from a lot of awesome projects, and that's exactly why all effort is focused on stability for the foreseeable future. The status of Copycat/Atomix is this: over the last six months myself and other contributors have been putting a ton of effort into getting the algorithms correct. Copycat implements every part of the Raft algorithm and more (I have a partially written paper on the "more"). I spent a lot of time with @madjam evaluating in particular the log compaction, configuration change, and session event algorithms, and we're confident they're correct (you can see some of these discussions in old GH issues). Atomix has been tested in Jepsen for linearizability via the DistributedValue resource (doing cas operations). @jhalterman is currently working on integrating configuration changes into the Jepsen tests (arbitrarily adding and removing nodes while writing to the cluster and compacting the logs). So, that's all to say up until about a week ago after I finished support for snapshots, all the effort was put into algorithms and features and handling various fault tolerance issues. But snapshots represented the end of that work, at least in Copycat. All we're doing now is testing Copycat/Atomix as much as possible, and it has been and I expect will continue to become much more stable every day. We'll push a release candidate once Jepsen testing with configuration changes is complete, and a full release if there are no issues with the release candidate. But Atomix will continue to be informed by use cases. Copycat/Atomix makes it incredibly easy to add new tools like a work queue.

I'm finishing up the work queue now BTW... I fell asleep last night :-P

Good question. The need for a DistributedWorkQueue in Atomix is to have something that will push work to clients rather than require them to poll and to guarantee items are not lost of a node crashes. So, what the work queue state machine does is: when a task is committed and applied to the state machine, it will push the task to a worker (client) if there's one available. Clients/replicas acknowledge a task and get the next task in a single commit. So in that sense it will be much faster than polling a queue, but two commits are still required to process one task (put the task on the queue, acknowledge the task). We could probably make a much more efficient version that acknowledges tasks in batches though. As for threading, in AtomixReplica the user always receives messages in a different thread than the server, and always in the same thread. The reason for using the same thread is because session events can be linearizable, so handling a message on the client has to be synchronous to maintain linearizability. But we could make a work queue that allows a single node to process a lot of tasks concurrently and ack them in batches since the task queue could be asynchronous anyways... If that makes sense. I guess there are several ways to go about it, but the scalability limitations of majorities in Raft will be there at least for writes to the queue.

Ultimately, the next version of Atomix will have sharding by running multiple Raft instances to get around that limitation.

Copycat does allow batches of messages to be pushed to a client. But it all depends on use cases. Does work need to be processed in order? Can the work itself be partitioned to enforce order only within a partition (send one partition to one worker). Can a single worker process many tasks concurrently? Does the process submitting the work need to know when it's done? Copycat can facilitate all of that, but there are a lot of potential options for different use cases.

Ultimately, though, comparitively speaking all Atomix resources (or the algorithms that back them) are designed to ensure that everything happens eventually, and that everything happens at a specific point in logical “time,” but not necessarily that anything happens fast or efficiently, and that’s why it’s best for things like leader elections or, in the case of a work queue, submitting work that is infrequent but important.

Gotcha. I think the intended goal of Atomix is to provide the tools to build something like a higher throughput work queue. In fact, I'm currently building a Kafka clone purely from Atomix resources. But as Kafka is does not require a quorum to commit a write, doing something like this still requires a lot of work on top of Atomix. In the case of my Kafka clone, that means using a DistributedMembershipGroup to do cluster management, allow clients to discover servers, and allow a leader to communicate with followers, a DistributedLeaderElection to coordinate and assign topics and partitions, and a DistributedMessageBus to send messages from clients to servers and synchronously replicate between servers according to the topic's replication factor. I expect in the future that use cases of will inform the development of more complex Atomix resources to make things like that easier, but for now Atomix resources will just remain low-level enough to allow people to build things that we can perhaps turn into resources in the future.

I would say this: in general the algorithm underlying Atomix are designed to facilitate a large number of resources and ensuring everything happens exactly as it's intended and when it's intended, even across resources. But for high throughout systems, those resources probably have to be used to coordinate other communication mechanisms. For example, you can use a DistributedMembershipGroup to track the set of nodes that are alive. A node will be removed from the group if it crashes. Then, on each node use the DistributedMessageBus to receive messages on that node. Clients would use the same DistributedMembershipGroup to determine where to send messages, and individual messages can be acknowledged by periodically updating a DistributedValue or DistributedLong. This I think could be one of the common patterns that's turned into an Atomix resource to allow for faster reliable messaging. But the problem with high throughout inside of the Atomix cluster - in a state machine - is everything is limited by memory right now, and so messages can't back up too far. If the client is reading from a persistent store and can replay messages then I guess that's not an issue... We'd just have to implement some sort of back pressure mechanism. I guess it wouldn't be too hard to create a resource that does that actually - faster reliable messaging from a persistent store - if the process pushing messages is reading from disk or something.

But I'm digressing a bit :-P

But I definitely get where you're coming from. The problem will be in Raft everything still has to go through a single node, so it will most likely be slower than a non-distributed system. You're trading performance for reliability. But that reliability can be used to coordinate a system that doesn't require writes to go through a single node.

But really, I think what's ultimately needed here is sharding. Copycat's abstractions are intentionally designed to make sharding relatively easy to add, but it hasn't been done because stability is the more essential short-term goal, and there are some questions to be answered about how resources should be partitioned and balanced in Atomix.

I'll finish up the work queue and do some performance tests there. I'm pretty interested myself

Here's how Atomix and any consensus based system should be used for a high-throughput system: Atomix should be used to ensure that messages are sent to the correct places at the correct times, but not necessarily handle the actual sending of messages. For instance, a leader election might be used to elect a leader to receive messages for several partitions, but the messages themselves don't go through Raft. What Atomix will guarante is: if a leader crashes, a single new leader will be elected, no two leaders will exist simultaneously, and the leader can be discovered by a client. That may not sound like much, but it's essential to ensuring that the processes that do receive messages are fault tolerant (when one leader crashes a new one is elected) and messages are always send to the correct process (so they're not lost by sending them to the wrong one). The DistributedMessageBus is unreliable insofar as it's the sender's and receiver's responsibility to coordinate to ensure messages aren't lost. It would be actually fairly easy to provide some reliability in the sense that messages will be automatically resent if lost, and it could even be feasible to do in-memory replication of messages by coordinating the replication through Atomix (DistributedMessageBus basically uses something like a membership group internally to coordinate messages already), but that wouldn't really make much sense to me since it's in-memory