Synchronizing Clocks In a Cassandra Cluster Pt. 2 – Solutions

This is the second part of a two part series. Before you read this, you should go back and read the original article, “Synchronizing Clocks In a Cassandra Cluster Pt. 1 – The Problem.” In it, I covered how important clocks are and how bad clocks can be in virtualized systems (like Amazon EC2) today. In today’s installment, I’m going to cover some disadvantages of off-the-shelf NTP installations, and how to overcome them.

Configuring NTP daemons

As stated in my last post, it’s the relative drift among clocks that matters most. Syncing independently to public sources will lead to sub-optimal results. Let’s have a look at the other options we have and how well they work. Desirable properties are:

Good relative synchronization; Required for synchronization in the cluster

Good absolute synchronization; Desirable or required if you communicate with external services or provide an API for customers

Easy to maintain; It should be easy to add/remove nodes from the cluster without a need to change configuration on all nodes

Netiquette; While NTP itself is very modest in network bandwidth use, that’s not the case for public NTP servers. You should reduce their load if feasible

Configure the whole cluster as a mesh

NTP uses tree-like topology, but allows you to connect a pool of peers for better synchronization on the same strand level. This is ideal for synchronizing clocks relative to each other. Peers are defined similarly to servers in /etc/ntp.conf; just use the “peer” keyword instead of “server” (you may combine servers and peers, but more about it later):

We define that nodes c0-c2 are peers on the same layer and will be synchronized with each other. The restrict statement enables peering for a local network, assuming your instances are protected by a firewall for external access, but enabled within the cluster. NTP communicates via UDP on port 123. Restart NTP daemon:

This setting is not ideal, however. Each node acts independently and you have no control over which nodes will be synchronized to. You may well end up in a situation of smaller pools inside the cluster synchronized with each other, but diverging globally.

A relatively new orphan mode solves this problem by electing a leader each node synchronizes to. Add this statement in /etc/ntp.conf on all nodes:

tos orphan 7

to enable orphan mode. The mode is enabled when no server stratum less than 7 is reachable.

This setup will eventually synchronize clocks perfectly to each other. You are in danger of clock run-away however, and thus absolute time synchronization is suboptimal. NTP daemon handles missing nodes gracefully and therefore high availability is satisfied.

Maintaining the list of peer servers in NTP configuration and updating it with every change in the cluster is not ideal from a maintenance perspective. Orphan mode allows you to use broadcast or manycast discovery. Broadcast may not be available in a virtualized network and, if it is, don’t forget to enable authentication. Manycast works at the expense of maintaining a manycast server and reducing resilience against node failure.

+ relative clocks (stable in orphan mode)

– absolute clocks (risk of run-away)

+ high reliability (- for manycast server)

– maintenance (+ in auto-discovery orphan mode)

+ low network load

Use external NTP server and configure the whole cluster as a pool

Given clock run-away as the main disadvantage in the previous option, what about enabling synchronization with external servers and improving relative clocks by setting up a pool across nodes?

As nice as it may look like, this actually does not work as well as the previous option. You will end up with synchronized absolute clocks but relative clocks will not be affected. That’s because the NTP algorithm will detect an external time source as more reliable than those in the pool and will not take them as authoritative.

– relative clocks

? absolute clocks (similar as if all nodes are synchronized independently)

+ high availability

– maintenance

– high network load

Configure centralized NTP daemon

The next option is to dedicate one NTP server (a core server, possibly running on a separate instance). This server is synchronized with external servers while the rest of the cluster will synchronize with this one.

Apart from enabling the firewall you don’t need any special configuration on the core server. On the client side you will need to specify the core instance name (let it be 0.ntp). The /etc/ntp.conf file must contain this line:

server 0.ntp iburst

All instances in the cluster will synchronize with just one core server and therefore one clock. This setup will achieve good relative and absolute clock synchronization. Given that there is only one core server, there is no higher availability in case of instance failure. Using a separate static instance gives you flexibility during cluster scale and repairs.

You can additionally set up the orphan mode among nodes in the cluster to keep relative clocks synchronized in case of core server failure.

+ relative clocks

+ absolute clocks

– high availability (improved in orphan mode)

– maintenance (in case the instance is part of the scalable cluster)

+ low network load

Configure dedicated NTP pool

This option is similar to a dedicated NTP daemon, but this time you use a pool of NTP servers (core servers). Consider three instances 0.ntp, 1.ntp, 2.ntp, each run in a different availability zone with an NTP daemon configured to synchronize with external servers as well as each other in a pool.

Clients are configured to use all core servers, i.e. 0.ntp-2.ntp. For example, the /etc/ntp.conf file contains these lines:

server 0.ntp iburst
server 1.ntp iburst
server 2.ntp iburst

By deploying a pool of core servers we achieve high availability for the server side (partial network outage) as well as for the client side (instance failure). It also eases maintenance of the cluster since the pool is independent from the scalable cluster. The disadvantage lays in running additional instances. You can avoid running additional instances by using instances already available outside the scalable cluster (i.e. static instances) such as a database or mail server.

Notice that core servers experience some clock differences as if each node is separately synchronized with external servers. Setting them as peers will help in network outages, but not so much in synchronizing clocks relatively to each other. Since you have no control over which core server the client will select as authoritative, this results in worsening relative clock synchronization between clients – although significantly lower than if all clients were synchronized externally.

One solution is to use the prefer modifier to alter NTP’s selection algorithm. Assume we would change the configuration on all clients:

server 0.ntp iburst prefer
server 1.ntp iburst
server 2.ntp iburst

Then all clients will synchronize to 0.ntp node and switch on another one only if 0.ntp is down. Another option is to explicitly set increasing stratum numbers for all core servers assuming that clients will gravitate towards servers with lower strata. That’s more of a hack, though.

+ relative clocks

+ absolute clocks

+ high availability

+ maintenance

+ low network load

– requires static instances

Summary

If you are running a computational cluster you should consider running your own NTP server. Letting all instances synchronize their clocks independently leads to poor relative clock synchronization. It is also not considered good netiquette since you unnecessarily increase load on public NTP servers.

For a bigger, scalable cluster, the best option is to run your own NTP server pool externally synchronized. It gives you perfect relative and absolute clock synchronization, high availability, and easy maintenance.

Our own deployment synchronizes clocks of all nodes to a millisecond precision.

Interesting point. NTP itself adjusts time slowly, so this is not an issue in real life. However, you can force NTP to adjust time immediately which may mess up things – don’t do that. If the time adjustments may cause inconsistencies, I would recommend changing how you store data. Append-only data model (silimar to event sourcing) is a good start.