There are various way to achieve resilience through link
redundancy, and these include somewhat explicit technologies
like
VRRP
and
CARP
at the routing level or
SMLT
at the network level, and usually one has to choose between
resilience and load balancing and fast switchover. But there
is an alternative that uses nothing more than
OSPF
and basic routing to achieve most of all these objectives, and
it seems fairly obscure, so perhaps it is useful to document
it here, as I have been using it with excellent results for a
while, especially in a medium high performance network
recently (it is an idea that has been
discovered
probably many times).

The
basic idea is that as OSPF
will automagically create a global routing state from status
exchanges among adjacent routers, that multiple routes to the
same destination will offer load balancing (on a per
connection basis) if
ECMP
is available, and that OSPF will also happily propagate host
routes to virtual, canonical IP addresses for hosts.

The basic setup is very simple:

Ensure that each host or network routers for which load
balanced resilience is multihomed on different backbones.

Assign an IP address to each such host or router from a
reserved subnet, on a virtual interface, that is one
not connected to any actual network; it is convenient to
make this address the OSPF router id.

Configure OSPF on each such host or router, and instruct
it to publish not just routes to the networks it is connected
to, but also a host (/32 prefix) route to that IP address.

Enabled ECMP on each such host or router.

That does not take a lot of work. The effects are however
quite interesting, if one refers to each such host or router
not by the address of any of its network interfaces, but by
that of the virtual interface. The reason is that OSPF will
propagate not just routes to each physical network, but also
so each virtual interface, via all the networks its node is
connected to.

If ECMP is enabled, and some of these routes will have the
same cost, load balancing will occur, across any of the number
of routes that have the same cost; if any of routes become
invalid, traffic to the virtual IP address will be instantly
rerouted by whichever route is still available. Instantly
because when an interface fails routes to it are withdrawn,
and any higher (or equal) cost route will then immediately be
used for the nextpacket.

It is also very easy to use
anycast addresses
or other forms of host routes to distribute services in a
resilient way; the canonical addresses of routers and
important servers are similar to anycast addresses.

As long as connections are from one virtual IP address to
another virtual IP address, they will eventually arrive as OSPF
creates and reshapes then set of routes across nthe various
networks.

This technique has some drawbacks, mainly:

It works only for IP with OSPF and ECMP. It is not
multiprotocol, and one must have available ECMP and
OSPF. This does not seem to me a big issue; for
GNU/Linux and various BSD flavours there are fairly good
implementations, and many if not most routers suppport
OSPF and ECMP (sometimes for an hefty extra fee as for
the Nortel routers I like).

It creates an extra (host) route per each node to which
load balanced resilient access is desired. This is unlikely to
be limiting, given that an OSPF network usually show not grow
beyond a few hundred routers, and that routers can handles
thousands of routes, and that not all nodes need to be given
resoilient multihomed load balanced access. Most nodes will
only need to be single homed on a multihomed router. The
greatest drawback of this technique in this respect that
listing routes becomes more verbose. But then this provides
also better information, as the detail of which routes lead
to a virtual address gives valuable information on the state
of links and connectivity of a network.

In order to achieve the full benefits of load balancing
and resilience services need to bind to the virtual address,
and ideally only to the virtual address. This is
very rarely an issue, and indeed it is an advantage, as
binding services to physical network interface addresses
makes it more difficult to achieve resilience and establish
access control anyhow.

As an example of a particular setup, imagine a site with:

Two 10Gb/s backbones being fibre based LANs each centred
on a high end router, on subnets 10.0.1.0/24 and 10.0.2.0/24,
each router having a connection to the rest of the Internet,
and publishing their default routes via OSPF. Backbone
routers would have 3 addresses; for example for the second
backbone 10.0.2.1 (on that baxkbone), 10.0.3.2 (as a
canonical address and router id in the
10.0.3.0/24 subnet), and an address on the Internet.

Every other LAN can be created as a router with
connections to each backbone, the router configured with
OSPF and ECMP. Each LAN router will have four addresses, for
example 10.0.1.70 (first backbone), 10.0.2.70 (second
backbone), 10.0.70.1 (its own LAN) and 10.0.3.70 (its
canonical address and router id).

Important servers can be connected directly to both
backbones and will run an OSPF daemon and be a router
itself, with three addresses, two on the backbones, and
one its canonical address in the router id subnet. If
referred to using this address it will be reachable no
matter how the topology of the network evolves, as long as
it can be reached.

In the above discussion a canonical
address is not indispensable, but very useful. The idea is that
a given router or server cannot be referred to with any of the
addresses of its physical interfaces, as in most such systems or
routers when a link die its associated interface disappears and
any address bound to it vanishes as well. Therefore each system
or router needs an
IP address bound to a virtual interface
(dummy under Linux, circuitless
for some Nortel routers, loopback in the
CISCO and many other cultures) in order to be always reachable
no matter which particular links and interfaces are active.
For each such system there will be a host route published
for its canonical address, but in most networks with dozens or
hundreds of subnet routes, some more dozen or hundreds of host
routes are not a big deal, with most routers being able to
handle thousands or dozens of thousands of routes.

This scheme is rather more reliable and simpler than the use
of floating router IP addresses using VRRP or CARP or other
load balancing or redundancy solutions, as it does not rely on
tricks with mapping between IP and Ethernet addresses. It
also can be extended to fairly arbitrary topologies, and with
the use of BGP and careful publication of routers it can be
extended beyond a single
AS.

It has also some interesting properties, for example:

In the topology described above there is no direct
communication between the two backbones, and this reduces
the chance of common modes of failure. Indeed this requires
having a separate spanning tree per backbone (because of
loops). Even better, no spanning tree on the backbone
networks. Am amusing detail is what happens when a router
with a link to just one backbone sends traffic to a router
on just the other backbone: dual homed non-backbone routers
will forward the traffic between the two backbones.

If one has two backbone routers, it is fairly easy to
configure BGP on them so that external connections are also
resilient and load balanced.

The file system used is the excellent
JFS
which is still my favourite even if the news that
XFS will become part of
RHEL 5.4
may tilt the balance of opportunity towards it. Anyhow even if
the files used by the test above are large at 400MB, the
filesystems used are fairly full, and JFS achieves transfer
rates very close (80-90%) to the speed of the underlying
devices for both reading and writing.

As to Atom CPUs I have been
wondering how power efficient they are, given that they seem
to be around 3 times slower than an equivalent mainling CPU.
So I found
this article
reporting a test of the amount of energy (in watt-hours)
consumed by some servers to run the same benchmarks, and it
turns out that among intel CPUs
the Core2Duo consumes the least
energy; while its power draw is higher, it is faster, and this
more than compensates for the higher power draw. What this
tells me is that the Atom is better for mostly-idle systems,
that is IO or network bounds, and the Core2Duo is better
for mostly-busy ones, that is CPU-bound.

Well, I often think that several mysterious computer issues
are due to PSU
faults. Indeed recently one of my older PCs stopped working
reliably: it would boot, and often allow installing some sw or
hardware, but also stop working abruptly, seemingly because of a
hard disk issue, as IO the disk would stop and the IO activity
light stay locked on.

Having tried some alternative hard disks and
some alternative hard disk host adapter cards with the same
results I reckonged that not all could be faulty, so I checked
the voltage on one of the berg style
connectors and I was rather surprised to see that the 12V line
was actually deliverng 13V and most fatally the 5V line was
actually delivering 4.5V, which is probably rather insufficient.
I wonder why; the PSU was not a super-cheap one (they can catch
fire) but a fairly high end Akasa
one.

This is one of the stranger PSu failures that I have seen so
far, where voltage on one rail actually rises and on the other
weakens to just too low, without actually failing.

It had to happen, yet I was still a bit surprised to see
an intelAtom
CPU based
rackmount server
as the most notable characteristic is that it is half
depth. That ssort of makes sense as one can then mount them
without rails and one in the from and one in the back of a
rack.

It is also a bit disappointing to see that the hard disk
is not hot pluggable and is a 3.5" one. The funniest detail
however is that the motherboard chipset is active cooled but
not the CPU.

The design logic seems to be for a disposable
unit for rent-a-box web server companies, where most such
servers are used fo relatively low traffic and low size sites,
and anyhow the main bottleneck is the 1Gb/s network interface,
and such servers are often bandwidth-limited to well less than
that, typically 10-100Mb/s.

At the other end of the range the same manufacturer have
announced another interesting idea,
1U and 2U server boxes
with the recent
i7
class
Xeon 5500
CPU, configured as a 1U/2U blade cabinet. That is
mildy amusing, and seems to me the logical extension of
Supermicro putting two servers side by side into a 1U box,
as a fixed configuration.

That is amazing as it represents a really large change for
Red Hat's strategy, which was based on in-place upgrade
from
ext3
to
ext4
RHEL6, and not introducing major new functionality in
stable' releases.
Some factors that I suppose might have influenced the decision:

ext4 has made it into the mainline kernel, but
only in a release that will be part of RHEL6, and that has kept
slipping a lot, currently to sometime (late) next year.

A lot of Red Hat customers, especially large Oracle
users, use
XFS
already instead of ext3 and hate losing Oracle
certification because of that.

Red Hat hired
Eric Sandeen,
one of the main
ex-sgi
developers of XFS, and sgi is now disappearing and
sponsorship of XFS is up for grabs.

The Red Hat sponsorship of XFS shifts a bit my preferences; I
have been using for a while
JFS
as my default filesystem as it is very stable
and covers a very wide range of system sizes and usage patterns
pretty well, I might (with regrets) move to XFS then.