MetalLB in layer 2 mode

In layer 2 mode, one node assumes the responsibility of advertising a service to
the local network. From the network’s perspective, it simply looks like that
machine has multiple IP addresses assigned to its network interface.

Under the hood, MetalLB responds to
ARP requests for
IPv4 services, and
NDP requests for
IPv6.

The major advantage of the layer 2 mode is its universality: it will work on any
ethernet network, with no special hardware required, not even fancy routers.

Load-balancing behavior

In layer 2 mode, all traffic for a service IP goes to one node. From there,
kube-proxy spreads the traffic to all the service’s pods.

In that sense, layer 2 does not implement a load-balancer. Rather, it implements
a failover mechanism so that a different node can take over should the current
leader node fail for some reason.

If the leader node fails for some reason, failover is automatic: the old
leader’s lease times out after 10 seconds, at which point another node becomes
the leader and takes over ownership of the service IP.

Limitations

Layer 2 mode has two main limitations you should be aware of: single-node
bottlenecking, and potentially slow failover.

As explained above, in layer2 mode a single leader-elected node receives all
traffic for a service IP. This means that your service’s ingress bandwidth is
limited to the bandwidth of a single node. This is a fundamental limitation of
using ARP and NDP to steer traffic.

In the current implementation, failover between nodes depends on cooperation
from the clients. When a failover occurs, MetalLB sends a number of gratuitous
layer 2 packets (a bit of a misnomer - it should really be called “unsolicited
layer 2 packets”) to notify clients that the MAC address associated with the
service IP has changed.

Most operating systems handle “gratuitous” packets correctly, and update their
neighbor caches promptly. In that case, failover happens within a few
seconds. However, some systems either don’t implement gratuitous handling at
all, or have buggy implementations that delay the cache update.

All modern versions of major OSes (Windows, Mac, Linux) implement layer 2
failover correctly, so the only situation where issues may happen is with older
or less common OSes.

To minimize the impact of planned failover on buggy clients, you should keep the
old leader node up for a couple of minutes after flipping leadership, so that it
can continue forwarding traffic for old clients until their caches refresh.

During an unplanned failover, the service IPs will be unreachable until the
buggy clients refresh their cache entries.

If you encounter a situation where layer 2 mode failover is slow (more than
about 10s), please file a bug!
We can help you investigate and determine if the issue is with the client, or a
bug in MetalLB.

Comparison to Keepalived

MetalLB’s layer2 mode has a lot of similarities to Keepalived, so if you’re
familiar with Keepalived, this should all sound fairly familiar. However, there
are also a few differences worth mentioning. If you aren’t familiar with
Keepalived, you can skip this section.

Keepalived uses the Virtual Router Redundancy Protocol (VRRP). Instances of
Keepalived continuously exchange VRRP messages with each other, both to select a
leader and to notice when that leader goes away.

MetalLB on the other hand relies on Kubernetes to know when pods and nodes go up
and down. It doesn’t need to speak a separate protocol to select leaders,
instead it just lets Kubernetes do most of the work of deciding which pods are
healthy, and which nodes are ready.

Keepalived and MetalLB “look” the same from the client’s perspective: the
service IP address seems to migrate from one machine to another when failovers
happen, and the rest of the time it just looks like machines have more than one
IP address.

Because it doesn’t use VRRP, MetalLB isn’t subject to some of the limitations of
that protocol. For example, the VRRP limit of 255 load-balancers per network
doesn’t exist in MetalLB. You can have as many load-balanced IPs as you want, as
long as there are free IPs in your network. MetalLB also requires less
configuration than VRRP – for example, there are no Virtual Router IDs

On the flip side, because MetalLB relies on Kubernetes for information instead
of a standard network protocol, it cannot interoperate with third-party
VRRP-aware routers and infrastructure. This is working as intended: MetalLB is
specifically designed to provide load balancing and failover within a
Kubernetes cluster, and in that scenario interoperability with third-party LB
software is out of scope.