Setting up Elixir cluster using Docker and Rancher

In the previous post
we went through putting Elixir app inside Docker image.
One downside of running Elixir inside Docker is that since containers have their own network,
even when running on the same physical host, two Elixir nodes inside two separate containers can’t connect with
each other by default.

What’s more, we are deploying Elixir (and all other) containers using Rancher that
distributes containers across multiple physical machines. Because of this, we can’t expose ports from Docker
container to host - on ever host there can be zero, one or more containers running the same application.
Fortunately, we can use Rancher’s built-in DNS service discovery to automatically discover new nodes and connect
to them.

How rancher works

One of the basic building blocks of Rancher are services.
Each service is defined as a Docker image plus configuration
(ENV, command, volumes, etc.) and can run multiple identical containers on multiple machines.

To make routing between these containers work Rancher provides it’s own overlay network and DNS-based service discovery.

For example, given the service named “my-hello-service” with three containers, running nslookup from inside one
of the containers (or a linked container) will give:

Since all containers in the same logical service can ping each other via this overlay network we only need to
make each node aware of the others.

Everything is dynamic

The IP addresses shown above are all dynamic - their lifetime is the same as the container’s lifetime.
They will change in case of container restart/upgrade or scaling (adding or removing containers from service).
Because of that we can’t use static file configuration using sys.config
(described e.g. in this post)

Instead we will make our app aware of Rancher DNS and benefit from it as much as possible.

Configuring Elixir node

But before we get to node discovery we first need to make sure Elixir nodes can see each other.
When using mix_docker (or in fact distillery)
the default node name is set to appname@127.0.0.1.
If we want to connect to other nodes in the 10.42.x.x network we need to change that.
And this setting also needs to be dynamic and set from within a container when it starts
(only then we can be sure about its Rancher network overlay IP address).

I’ve spent quite some time figuring out how to do this and finally found out a solution.
In a nutshell we need to make vm.args aware of ENV variables and then just set those variables on container boot.

After building the image and upgrading Rancher service with new image you can check if containers can connect with each other.
This can be done by connecting to remote machine via SSH and running docker exec like this:

As you can see in the iex prompt the node is set to use Rancher overlay network IP.
Now start another container for the same service and get it’s IP (it will also be in the 10.42.x.x network).
To confirm that nodes can see each other you can try to connect from one to another using Node.connect.
Assumming the IP address of second container is 10.42.240.66:

Now you can do all the standard distributed Elixir/Erlang stuff using Node module.

DNS-based auto-discovery

In the previous sections we managed to setup a connection between two Elixir nodes, but it required providing the other node
IP address manually. This is definitely not acceptable for production, so let’s make use of aforementioned DNS.

The Elixir equivalent of nslookup is :inet.gethostbyname or more specifically :inet_tcp.getaddrs(name).
Running this function inside one or our containers will return a list of all service containers IP addresses:

All is left is to spawn an erlang process on every node that will periodically call this function and connect to other nodes.
And we can make it as a simple GenServer that will check the DNS every 5 seconds.

We do not need to worry about nodes that disappear - they will be removed from nodes list as soon as they disconnect.
Also Node.connect does not care about connecting twice to the same node, so there is no need to check Node.list before
connecting.

One last thing that will make your life much easier - dynamically setting Rancher service name.

As you can see in the code above the name of service (and hence the DNS host) is taken from application config :rancher_service_name key.
We could set this to a static value in config/prod.exs, but it is much better to make it also auto-discoverable.

We are again using Rancher internal metadata service to get the name of current service.
This will allow to reuse the same container image for different services (with possibly different runtime configuration).

Show time!

In the short screencast below you can see it all in action.
On the left side there is the log stream from one of the containers
and on the right side there is a list of containers running under “my-hello-service” service.
As I add or remove containers by scaling the service the container on the right catches up with the changes
and updates it’s connected nodes list.