I am trying to implement clustering for my database process and hoping to leverage Atomix instead of Zookeeper so that it's embedded within the database process itself. The approach I am trying to take at the moment is to leverage Atomix to have a leader elected per database. Additionally using Atomix to keep a set of all the databases currently online in the cluster. When a new one is created, a new DistributedGroup for that database will be created and all other nodes should join this group so that a leader can be elected for this database. All writes will be handled by the leader which will also send replicated data to the replicas. The question I have is 1. is Atomix the right framework for this? 2. How do I get the IP address to members in a DistributedGroup since I would like to leverage custom RPC rather than Atomix messaging which I understand will use a WAL? 3. in my use case these databases have potentially thousands of tables, would it make sense to have leaders per table or would that cause chaos when a node goes down since leader election cycle will need to happen for each of those tables? 4. how can I leverage consistent hashing with Atomix? @kuujo Would really appreciate your thoughts!

I think it's perfect for Atomix. A lot of people come here asking how to use Atomix to build a database. You came here asking how to coordinate a database. That's what you should be doing.

I think you're doing the right thing here. Atomix internally uses a point to point messaging abstraction with a Netty implementation that will eventually be exposed to users, but isn't well suited to general messaging. Anyways, you should just assign metadata to members when they join, e.g. group.join(new MyMetadata()) to associate information with a group member.

I think there will be a lot less chaos than you think there will be when node goes down even if you have a thousand groups and a thousand leaders. The reason is from the underlying state machine's perspective, a node crashing is still a single event. It will appear to each group as a bunch of separate events, but a single node crashing is 1000 method calls in the state machine, which is trivial. The only real overhead will be a single commit to the Raft log for the expired session (which triggers leader changes in all the state machines) and servers sending leader change events to clients, but again 1000 leader election events is fairly trivial as they're simply TCP messages.

In the past, there have been utilities for consistent hashing. But they've been removed in favor of simpler abstractions for now. I suppose the way to do consistent hashing would be to use DistributedGroup for cluster membership/failure detection and hash nodes/keys using the group. This is actually the one major distributed systems algorithms that I don't have a reference implementation for, but have one planned.

Here's what I tell new users exploring ways to use Atomix: Atomix and it's primitives can be powerful if used correctly, but it should be seen merely as a component of a solution and not a solution in and of itself. There are obviously limitations to consensus in terms of performance and fault tolerance. People are often too eager for a total solution, but I think you're thinking about this exactly the right way.

Thank you @kuujo for the explanation and your time, it completely makes sense. Regarding consistent hashing, what are your thoughts about doing 1. decentralized consistent hashing Vs. 2. centralized consistent hashing. To elaborate, what I mean by this is to have (for 1) the consistent hashing ring on each machine and coordinate updates to the ring using Atomix broadcasts where the leader sends messages about node joining and leaving the cluster. Or (for 2) have the leader compute shard locations using consistent hashing and store the information in Atomix but then there needs to be either a message driven or poll based mechanism to update the local route tables. Approach 2 is what I see happening in projects like Apache Helix. Would really appreciate your thoughts on this!

I have a DistributedGroup representing the nodes in the cluster. Each node joins the group at startup, and builds an identical hash-ring from the list of member-ids upon joining. Then each node listens for onJoin and onLeave events, and updates its local view of the ring accordingly.

To ensure orderly rebalancing transitions, I also establish a DistributedLock for each resource ("partition") that is assigned via the ring. When a node takes ownership of a partition, it waits to obtain that lock before beginning processing. Relinquishing a partition involves releasing that lock after processing is halted/flushed and checkpoints have been captured.

no, I'm not managing persistence in this layer; I just need to distribute processing among a cluster of workers. I haven't thought much about how one might achieve orderly handoffs of replication responsibilities

got it, thanks for sharing this info @joshng! Another follow-up question about onLeave and onJoin is is there any message delivery guarantees for those listeners or like zookeeper one is expected to poll for state changes

offhand, I think I might attack replication topologies with some sort of epoch-rollover system, perhaps creating a new DistributedGroup for each version of the ring: nodes could signal their readiness for a given ring structure by leaving the old group and joining the next?

I do believe all group-members should observe onJoin and onLeave events in the same order, but @kuujo would probably be better qualified to confirm that