When things are managed automatically with good tools, they work. When people manage them, they often work. Here we talk about automation, managing systems, monitoring, discovery, open source, and related topics.

29 October 2010

Really Big Clusters: A Scalable membership proposal

This blog entry is a bit different than previous entries - I'm proposing some enhanced capabilities to go with the LRM and friends from the Linux-HA project. I will update this entry on an ongoing basis to match my current thinking about this proposal.

This post outlines a proposed server liveness ("membership") design which is intended to scale up from a dozen or so servers to tens of thousands of servers to be managed as an entity.

Scalability depends on a lot of factors - processor overhead, network bandwidth, and network load. A highly scalable system will take all of these factors into account. From the perspective of the server software author (for example, me), one of the easiest to overlook is network load. Network load depends on a number of factors - number of packets, size of the packets, how many switches or routers it has to go through, and how many endpoints will receive the packet. To best accomplish this task, it is desirable that the majority of "normal" traffic be network topology aware. To scale up to very large collections of computers, it is also necessary that as much as possible be monitored as locally as possible. In addition, since switching gear is not optimized for multicast packets, and multicast packets consume significant resources when compared to unicast packets, it is desirable to avoid using multicast packets during normal operation.

The Basic Concept - network aware liveness

Although the LRM in Linux-HA is not network-enabled, it tries to minimize monitoring overhead by distributing the task of monitoring resources to the machine providing the resource, and only reporting failures upstream.

To extend this idea of local monitoring into system liveness monitoring, one might imagine a standard 48-port network switch with 48 servers attached to it. If one were to choose a server to act as proxy for monitoring the servers on that switch, then the other 47 nodes on the switch would send that unicast heartbeats to that node. That node in turn would report on failure-to-receive-heartbeat events from the other 47 nodes on the switch.

Although this is simple to understand, it isn't ideal either - the work of detecting failures within a switch is concentrated in a single node - and what happens when a switch fails. So, let's try a different idea...

What if instead of having a hierarchy, the nodes within a switch arranged in a sort of doubly linked circular list of nodes in a switch - where each node sent heartbeats to the node "ahead" of or "behind" it on the list - making a ring-like structure. This would spread the work for detecting failures across to more machines. For the 48-port switch, then each would send a heartbeat to the node with the next lower port number, and to the node with the next higher port number. The first port would send heartbeats to port 2 and port 48, and port 48 would send heartbeats to ports 47 and 1. Since each node is monitored by two other nodes, monitoring has no single point of failure. This design is different from normal communication rings because there is no token being passed around the whole ring - just heartbeats going to a single node ahead and a single node behind. So, conventional thoughts about latency in rings don't apply here.

If one were to implement this in the context of a cloud-like infrastructure, or a monitoring environment it is likely to be desirable to also run the LRM (or something like it) on these same machines so one can perform more detailed monitoring and/or effect changes to the managed servers.

Some Tentative Terminology

In order to facilitate discussion, I'll use the following terms (which are, of course, subject to change):

Overlord - the policy-aware management entity at the top of the food chain. The Overlord layer might be a cloud management layer, it might be a System management package like IBM director or Nagios, or a high-performance scientific cluster management entity, or some other entity interested in managing large numbers of computers. The term overlord is used to refer to a management capability - not necessarily implemented in a single computer.

Minion - a monitoring node which is listening to heartbeats from other nodes and reporting failure-to-receive-heartbeat events to an Overlord. The minions at the lowest level in the system are only part-time minions - with the rest of their resources being spent on performing other tasks. Dedicated minion servers should be capable of monitoring thousands (or tens of thousands) of other minions. It is anticipated that the minion function would be performed by some software which also acts as a network proxy for an LRM client. Minions send heartbeats, listen for heartbeats, report problems up the line, and act as LRM proxies.

Datagram Protocols

I anticipate the need for two different datagram transmission protocols, UDP (unreliable datagram protocol) and RDS (reliable datagram service). UDP would be used for heartbeats, and RDS for communication between the Overlords and the LRM proxies (regardless of role). It is expected that digital signatures are sufficient for heartbeats, but encryption may be desirable for some Overlord communication.

For this application, it is anticipated that the reliable datagram communication would not be high bandwidth, or likely need congestion control. Other options in place of RDS would be an application protocol over UDP, or something like Qpid (AMQP). I suspect any of these would work nicely.

LRM Proxy Design Goals

Here are a few design goals which are good to keep in mind during detailed design and implementation:

Continue the policy-free approach that the LRM implements. Although this implementation is intended to be topology aware, the LRM proxy code should rely on directions from is Overlord to tell it where and how often to send heartbeats, which heartbeats to expect (and how often), and other configuration information. The Overlord is be the entity that "understands" the overall heartbeat topology. Nevertheless, it is possible that the LRM proxy may gather some low-level network configuration for its Overlord.

Minimize resource consumption - CPU, memory, network bandwidth should be held to a minimum. Although this component is inherently lightweight, it will exist on every server in the collection - so resource waste is multiplied by thousands or tens of thousands of instances.

Minimize latency. Even if there is a hierarchy of watchers watching for failures doesn't mean that reporting problems should go back through the hierarchy.

Minimize membership workload on non-dedicated minions.

Try to design the API so that it is flexible enough and stable enough that version mismatches are unlikely to cause problems.

Keep it simple. The LRM proxy is a simple concept, the detailed design and implementation should be simple as well.

Requests to expect heartbeats from a certain address at a certain rate

Requests to stop expecting heartbeats from a certain address

Requests to add an address of an Overlord

Requests to delete an address of an Overlord

Requests to register a new Overlord authentication key

Requests to drop an Overlord authentication key

Requests to change the outgoing LRM proxy authentication key

Requests to stop sending boot-up multicasts

Sending failure-to-receive-heartbeat messages to Overlords

Sending LRM return result messages back to the registered Overlords

As implied above, at startup time, each LRM proxy would send out a multicast or broadcast packet announcing it was now up and ready to accept requests. It will continue to send them until it receives an authenticated message from an overlord telling it to stop. From then on the normal proxy API would be employed.

Some Thoughts about Heartbeating Topologies

This design could also be extended to create multiple levels of bidirectional rings between all the switches on a subnet, and between all the subnets. The point is not that this is an optimal topology, but that the design of the LRM proxy component is topology-independent - it should support any unicast-based topology.

Below is an illustration of what a multi-level ring topology might look like.

For this smallish network, there are 54 servers. 12 of them send and receive heartbeats from 4 peers. The remaining 42 have two peers - averaging 2.4 connections per server. In a larger network, this number would be closer to 2.

If one wanted a higher degree of redundancy, one could imagine two shifted rings - with each node talking to node n-2, n-1, n+1 and n+2. This would increase the number of packets sent and receive per interval to 4. This could be used at all levels, or at selectively at higher levels.

There are no doubt other interesting topologies one could imagine as well.

Advantages to this LRM proxy design

Here are a few advantages to this design:

The LRM proxy design is simple and straightforward.

Once a few releases have gone out, it is likely that the LRM proxy code and API will be stable for a long time. This is a significant advantage for a component which is deployed on tens of thousands of computers. It also simplifies upgrades, since it would be a very rare occurrence that an LRM proxy update would have to be performed in order to update the Overlord software.

The LRM code will not have to change in order to fine tune the network resource usage. Since where to send packets, and who to send them to and how often to send them are commands that come from its Overlord, the LRM proxy neither knows nor cares what the topology is.

Although this was designed with network topology-awareness as a goal, it is not necessary that an Overlord implement topology-awareness, either initially, correctly, or at all. The LRM proxy concept will function correctly without topology awareness, it will simply consume more network resources than necessary.

Potential problems with this design

Providing switch topology information to the Overlord(s) is a potential difficulty. There are some protocols implemented by some switches which provide information similar to what's required. Further investigation is required to determine how difficult discovery and auto-configuration of network topology will be. The Link Layer Discovery Protocol (LLDP) may be of help here.

The scope of the problem being addressed is very large. The addition of layers of monitoring will likely introduce difficulties in debugging problems. Infrastructure for determining the origin of anomolous behavior should be provided.

The flexibility in setting up monitoring topologies adds complexity to the system (even though it keeps it out of the LRM proxy).

There is the possibility that the Overlord will receive conflicting reports regarding the liveness of a particular node. This potential lack of consensus and timing issues will add complexity to the corner cases in the Overlord.

Open Issues

There are many possible topologies for handling the upper layer Minion workload distribution. It is not yet clear what the set of good choices are, and what the tradeoffs are between those different topologies.

It is possible that it will prove desirable for this same infrastructure to collect statistical information for the Overlords. Conversations with experts in high-performance computing and examination of tools like Ganglia will likely prove helpful in better understanding this problem. What I have in mind is some amount of periodic piggybacking of data on the heartbeats, and then aggregation by nodes in minion roles, and periodic uplinks of this data to the overlords. The goal would be to reduce the number of uplinks packets while having minimal impact on the timeliness of the data. Not sure if this is a good idea or not.

What about IPv6?

Concluding Thoughts

This proposal only provides liveness information and distributed control for a large collection of managed servers. In many respects, this is perhaps the easiest component to scale up. It is a long way from a complete cloud infrastructure, clustering software, or enterprise-scope system management package. Scaling the Overlord components above this layer to the same standard of size is a very interesting and challenging task. I suspect that this design is sufficiently scalable that it is likely that other components in the system architecture are likely to be the limiting factors in system scaling.

An Update

I have now posted a follow-on article outlining some ideas for the "overlord" portion of the infrastructure.

That's an interesting (and thoughtful) comment. This doesn't look very much like Linux-HA (or Pacemaker) - and it's a long way from a complete solution. An OSPF network with 10K routers is a very big OSPF network. (not 10K hosts, or 10K switches - they don't participate in OSPF).

What specific way do you think it should look more like OSPF?

Here are my off-the-cuff thoughts on this question...

How is this problem like OSPF? - it is trying to manage liveness, it is trying to be local network topology aware and network efficient.

How is this problem different from OSPF? It's not trying to solve the "let's help independent fiefdoms work together" problem. At this level, all machines are "owned" by the same owner. It is not trying to provide anything more than liveness (it's trying to solve a simpler problem). There is no distributed control (at this level).

Thinking about this design - it seems to me it's more like DHCP than any other internet protocol that I can think of, since it's centrally managed and has other things in common. For example...

When we boot up, we send a multicast/broadcast packet asking for someone to tell us who we are and how we should be configured (analogous to getting DNS entries and so on from DHCP). Like DHCP clients, our machines "renew their leases" periodically - except we do it a *lot* more often. Instead of measuring lease renewal times in minutes, hours or days, we measure them in seconds (or potentially even in fractions of seconds). To compensate for this, we distribute the our "dhcp server analog" throughout the network.