Roberto Ostinelli:
> the idea is quite simple: queuing all messages sent from a node to another
> node, and sending them by groups. the whole concept is therefore to have a
> gen_server, called 'qr', running on every node where message passing takes
> place. a single process from a node A sends a message to a process of node
> B, being relayed by the two 'qr':
>> process on node A => 'qr' on node A => 'qr' on node B => process on node B.
We've had something vaguely similar in production for a year, where
there are peaks of 100K messages per second between nodes.
One thing you need to watch out for is that sending a message to
another node can sometimes take a long time. I did a test where I on
purpose ran a server out of memory and CPU until the swapping made it
more or less unresponsive. I can't remember exactly how long time it
took to send (or fail to send) a message to a node on that server from
another machine, I just remember that it hurt. It may even have been
seconds.
If that happens, and you have a single 'qr' per node, one failing node
can slow down all your traffic. Unless you only have two nodes of
course. One 'qr' per node per node (if that makes any sense) should
protect you from that problem.