Kubernetes, RabbitMQ and Celery provides a very natural way to create a reliable python worker cluster. This post is based on my experience running Celery in production at Gorgias over the past 3 years. The scope of this post is mostly dev-ops setup and a few small gotchas that could prove useful for people trying to accomplish the same type of deployment. At the end we’ll try to shut down a machine to see if our cluster is indeed reliable as I claim.

Before diving too deep I recommend refreshing your knowledge on the tools we are going to use. In particular I expect:

You have some familiarity with Kubernetes and in particular: pods, services, deployments, stateful sets and persistent volumes.

You used or know about Celery: how it schedules it’s tasks and executes them.

Pieces falling into place: Why Kubernetes, RabbitMQ and Celery?

Kubernetes is a very reliable container orchestration system (runs your docker images). It is used by a vast number of companies in production environments. It’s proven tech and provides a good base for making failure tollerant and scalable applications such as an async worker cluster which is what we’re trying to do.

RabbitMQ is a popular open source broker that has a history of being resilient to failure, can be configured to be highly-available and can protect your environment from data-loss in case of a hardware failure.

Celery is probably the most popular python async worker at this moment. It’s feature rich, stable and actively maintained. Celery (or any other worker) by it’s nature is distributed and relies on the message broker (RabbitMQ in our case) for state synchronisation. It’s also what we use at Gorgias to run asynchronous tasks.

Given their properties I hope that when we put all of the above components together you will have a robust worker cluster that is easy to scale and will be hard to brake.

Kubernetes (k8s) and Helm setup

First we’ll need to run k8s either via minikube on your local machine or using some k8s provider such as Google Kubernetes Engine (GKE). For this tutorial I’m going to use GKE, but feel free to use any kubernetes environment.

Assuming you have your gcloud setup let’s create a new 3 node Kubernetes cluster. It will take a few minutes so feel free to grab some refreshment:

Make sure you delete your cluster (so you don’t get charged) after you’re done like so:

gcloud container clusters delete ha-celery -z us-east1-c

Test that you have access to it:

kubectl cluster-info

Great. Now let’s install Helm. Helm is a package manager that will allow us tocreate a template for our RabbitMQ and Celery deployment. We can use it to run a development, staging or production environment without having to maintain separate configs for each one. It’s also easier to use than running kubectl commands when dealing with multiple k8s primitives at a time.

Above we have a custom RabbitMQ Dockerfile that inherits the official RabbitMQ image and adds a few extra features.It installs the rabbitmq_management plugin which I highly recommend if you want to understand what’s going on in production. I copies our rabbitmq.config and cmd.sh which will see later. And finally exposes the default RMQ ports.

The first part about loopback_users and listeners is pretty straight forward. hipe_compile setting is false, but can be true if you use the high-perf Erlang.

Now what I believe to be an extremely important setting is cluster_partition_handling which has the default value set to ignore. It’s extremely important for it to be pause_minority instead in order to avoid split-world and data loss situations. Here’s why:

Imagine you have a RMQ cluster of 3 nodes setup: rmq0, rmq1 and rmq2.All is running smoothly and then suddenly you have a network partition that separates rmq0 from the other 2 nodes. Note that all the nodes are still running, it’s just that they can’t communicate between themselves.Now if you have cluster_partition_handling set to the default ignore all the clients connected to rmq0 will still be able to read from it and most importantly write to it! Here’s a more concrete example:

Suppose we have 2 Celery workers (w0 and w1) that get tasks from the celery queue (the default one for celery).w0 is connected to rmq0 and consumes the celery queue and w1 that is connected to rmq1 and does the same.

In the case of a network partition with the RMQ defaults w0 will consume celery as if nothing happened, the problem is that w1 also does the same on rmq1. In this situation the same task can be consumed 2 times! When the network partition is fixed which of the rmq nodes in your cluster holds the truth?This is called a split-world or split-brain situation and it literally can cause your brain to split trying to untangle the mess. This happens more often than you might think even on very reliable hardware and network. In this situation you’re better off re-creating the queue from scratch with some nasty data loss in the process.

Basically what cluster_partition_handling set to ignore is saying: In the case of a network partition the RMQ nodes will just ignore this failure and continue running as if nothing happened accepting regular consumer operations. Setting it to pause_minority will cause the nodes in the minority (in our case rmq-0) to pause - becoming read-only basically. Once the cluster is back it should sync with the other nodes and get back on track. This IMHO is what the default should be because it avoids the split-world situations which is what I believe most really want.

The above script will attempt to join a cluster if the $CLUSTERED and $CLUSTER_WITH vars are set and then attempt to create a RMQ virtual host and a user. This part below sets the default HA policy for the vhost which in our case here is all meaning that all queues will be replicated across all nodes:

This service defines a way for our application and the cluster to connect to their nodes. Note that the clusterIP: None - that’s because we don’t want k8s to load-balance our RMQ cluster, we’ll do that on the Celery application level.

The StatefulSet will first claim a 10Gi persistent volume with the standard storage type for each of our nodes defined by replicaCount. Then it will start each rabbitmq pod and check if it’s alive and running.

Deploy the RabbitMQ cluster in Kubernetes

While still in the rabbitmq-statefulset directory we can deploy our chart to k8s using the helm command:

I not going to dig into this helm chart: It’s a simple k8s Deployment with 3 replicas that runs the celery worker command.Make sure that we workers are running using: kubectl get pod and then look at their logs using kubectl logs <name of the pod>

Schedule our tasks

Hopefully you have your workers running by now, but they are not doing anything yet. We need to use our scheduler to make then execute the count tasks:

kubectl create -f scheduler.yaml

Once the scheduler is started you should start seeing some activity in your worker logs:

Observe on which node rmq-0 is running. In the above example it’s gke-ha-celery-default-pool-5939d2ec-dc85 - this is the node we’re going to remove from the pool to cause a little havoc.

By default celery connects to the first RMQ node in the list (see BROKER_URL in the tasks.py) and then if the connection fails to the first broker it goes to the next one, etc..

Before killing a k8s node to see what happens let’s observe our cluster using these commands (run each one in it’s own terminal):

# same as above - but choose a pod that is NOT running on a node that you plan to kill
# the reason for this is to observe how the works fail over to a different RMQ node.
kubectl logs -f counter-counter-3473528255-24z8m
# to see how rmq-0 dies
kubectl logs -f rmq-0
# to see how rmq-1 takes over the traffic from the workers
kubectl logs -f rmq-1
# to see how pods stop and then start again
watch "kubectl get pods -o wide"
# to see how the node gets unscheduled
watch "kubectl get nodes"

Next we’ll mark the k8s node that runs rmq-0 as unschedulable and then we’ll drain (kill) all pods on it, you should choose the the node that runs the rmq-0, you can see it running this command kubectl get pods -o wide:

# marks the node unschedulable (the one that runs rmq-0)
kubectl cordon gke-ha-celery-default-pool-5939d2ec-dc85
# kills all pods that run on it
kubectl drain --force --ignore-daemonsets gke-ha-celery-default-pool-5939d2ec-dc85

Now keep your eyes on your monitoring commands. You should see how rmq-0 is dying and 1 or 2 celery counters since they might be running on the same node as rmq-0.If everything went according to plan you should see a log in your worker like:

consumer: Connection to broker lost. Trying to re-establish the connection...
...
Cannot connect to amqp://my_user:**@rmq-0.rmq.default.svc.cluster.local:5672/my_vhost: [Errno -2] Name or service not known.
...
Connected to amqp://my_user:**@rmq-1.rmq.default.svc.cluster.local:5672/my_vhost

And then just as before continue to execute the tasks after the period of failure.

I also recommend looking at the rmq-1 logs to see how the clients start connecting to it and then how it accepts again the rmq-0 into the cluster once it gets up again.

Conclusion

What we did so far:

Create a new k8s cluster

Build and pushed the RabbitMQ and Celery images to the Google Container Registry

Deployed the helm charts for RabbitMQ and celery cluster.

Removed a k8s node and observed the behavior of our workers and RMQ nodes.

Expect your workers to die at any moment and always code with that in mind.

IMHO this is the hardest part of all.

At Gorgias we’re sending tons of emails/chats and facebook messages and also making HTTP requests to user defined HTTP endpoints before sending the aforementioned messages which can fail with a timeout or an error. These are just a few questions that arise:

Should the emails be sent if other parts of the process have passed or not? If not how should we notify the customer?

What happens if the HTTP service we’re trying to reach times out? How many times should we retry before giving up? What happens then?

What if the worker is killed in the middle of the transaction with the mail server? Was the email sent or not? Should we retry? Should we notify the customer?

If the mail server is down do we have a retry mechanism and when should the retry switch to a different server?

The above are but a few of the questions that come to mind when thinking of our application, in reality there are many more and the answer is not always simple. The code required for failure handling makes the application much more verbose, harder to debug and maintain. We’re try having less and simpler features precisely because some many things can go wrong.

Even though the application level failure handling is very hard it’s still a million times better when you know that you can rely on Kubernetes and RabbitMQ to stay up and running your application code even if your VMs or physical machines go down. It’s a lot easier to build resilient and scalable applications that it was before kubernetes in my option and I hope that this post illustrates just that.

This is a small tutorial on how to do incremental backups using pghoard for your PostgreSQL (I assume you’re running everything in Kubernetes). This is intended to help people to get started faster and not waste time finding the right dependencies, etc..

pghoard is a PostgreSQL backup daemon that incrementally backups your files on a object storage (S3, Google Cloud Storage, etc..).For this tutorial what we’re trying to achieve is to upload our PostgreSQL to S3.

The fascination with AGI has been mainstream for a long time, but it started having more even more momentum in the recent years. Even hollywood has become less naive with movies like Her and Ex Machina.

On the R&D side there is of course Deep Learning which is a machine learning technique that uses neural networks with 1 hidden layer :P It has changed I believe forever the way people are doing research today. The hype is real because of the state of the art results achieved with it and the way the skills translate across different fields of ML. AlphaGo beats the best player in the world, translation and image/voice recognition is becoming better, artistic style stealing, attention models, etc.. The best part is that it’s more or less the same RNN with different neuron architectures, backprop and gradient decent that works with a broad range of problems. Now people are looking to for nails because they have a damn mighty hammer.

Of course hooking up a bunch of NVidia Pascals is not gonna give us AGI and the Moore’s law is not what it used to be. I could not agree more, but if we overcome the hardware issues (and I have high hopes that AR and VR is gonna push this) then it’s reasonable to assume that we’ll have the hardware to achieve at least weak AI soonish…

What about software? That maybe a bigger problem. But.. I’m also optimistic here with things like torch and recently tensorflow are given ton of attention from one of the best minds in the AI world today. What’s really cool about these frameworks is that they are used everyday in production on real products by startups and big corp alike. They are here to stay. It’s not enough, but I’m hopeful that things will improve.

Ok, so I want to say something that has been bugging me a long time, bare with me, I believe it’s important for the arguments that follow.

… is the intelligence of a (hypothetical) machine that could successfully perform any intellectual task that a human being can…

Now I have a problem with this definition because I would argue that in a cosmic sense we, the humans, haven’t achieved what I would call general intelligence. We’re kind of good at surviving in the Earth’s atmosphere. We can do many things that are amazing and not accessible to most animals, but we’re still bound to our environment. We’re still I would argue narrow in our intelligence and can only grasp a small fraction of what’s out there.There exists true AGI which is AIXI. It will seek to maximize its future reward in any computable environment (survive and expand), but there is this tiny little problem of requiring infinite memory and computing power in order for it to function. It’s useful just like the Turing machine is useful in the real world.For any intelligent agent to be practical, it’s required a favourable environment and a narrow specialisation for that environment. This is why I think that we’re really after is strongish AI which translates to being pretty cool in your neighbourhood.

Docker structure

The killer feature of Docker for us is that it allows us to make layered binary images of our app. What this means is that you can start with a minimal base image, then make a python image on top of that, then an app image on top of the python one, etc..

gorgias/app - This installs all the system dependencies: libpq, libxml, etc.. and then does pip install -r requirements.txt

gorgias/web - this sets up uWSGI and runs our flask app

gorgias/worker - Celery worker

Piece of advice: If you used to run your app using supervisord before I would advise to avoid the temptation to do the same with docker, just let your container crash and let your Kubernetes/Swarm/Mesos handle it.

If you know Stripe, Mailgun, or Zapier you might know what I’m talking about. They are all just a bunch of APIs. They are created to make running companies easier through automation. So we know that payments can be automated, billing, mail-delivery. But where is the limit?

What if there was a 100% software company that did client prospecting on it’s own, responded to clients on it’s own, resolved legal problems on it’s own and (blasphemy!) created a product on it’s own.

You get the picture.. everything on it’s own.

The people who I talked to about this said I was crazy (and that I want to destroy humanity).Here’s what they say:

There is no way to get the accounting right (in France!!?! Crazy!!! Jail time!).

How would you even begin designing a product for users, have interviews with them, etc.. you would need a Hard AI! You totally 100% require a human for this.

They are right of course, but.. given that there are so many amazing tools that allow us to automate so many parts of our business then what remains unorganised, unstructured?

What if you don’t need human level intelligence if you just have better structured information? At least to make a stupid simple product.

I’m now going to borrow something from my art friends and say that I’m proposing an Art project. Look.. this is just an experiment, a joke, a way to show that building a business has nothing to do with having a human brain.

Of course, what I’m not going to try to implement this Art project. What I’m really after is finding the remaining parts of a business that are difficult to automate and try to make it automatic. Isn’t this what we are looking for? Look at all those SaaS companies trying to remove the pains, scale and automate stuff that wasn’t automated before? And they are so cheap too! Where does this all going to lead?

My prediction is that soon all we’re going to have is a bunch of Cronjobs and message brokers lousily connecting the different APIs together controlled by some reinforcement learning algorithm that looks to increase that Stripe balance. Think Zapier, but without you creating all the rules.

While this swarm like AI is probably not technically feasible at the moment I personally use it as a framework for thinking about the products.

What hole is this product filling in my 100% software company?

Btw, if you’re looking to improve your customer support through automation. Come check us out at Gorgias.