Upgrades in Nova: RPC APIs

This is a part of a series of posts on the details of how Nova supports live upgrades. It focuses on the very important task of doing proper versioning and handling compatibility in your RPC APIs, which is a baseline requirement for supporting environments with mixed versions. The details below are, of course, focused on Nova and should be applicable to other projects using oslo.messaging for their RPC layer.

If you’re not already familiar with RPC as it exists in many OpenStack projects, you might want to watch this video first.

Why We Need Versioning

It’s important to understand why we need to go to all the trouble that is described below. With a distributed system like Nova, you’ve got services running on many different machines communicating with each other over RPC. That means they’re sending messages with data which end up calling a function on a remote machine that does something and (usually) returns a result. The problem comes when one of those interfaces needs to change, which it inevitably will. Unless you take the entire deployment down, install the new code on everything at the same time, and then bring them back up together, you’re going to have some nodes running different versions of the code than others.

Versioning your RPC interfaces provides a mechanism to teach your newer nodes how to speak to older nodes when necessary, and defines some rules about how newer nodes can continue to honor requests that were valid in previous versions of the code. Both of these apply to some time period or restriction, allowing operators to upgrade from one release to the next, ensuring that everything has been upgraded before dropping compatibility with the old stuff. It’s this robustness that we as a project seek to provide with our RPC versioning strategy to make Nova operations easier.

Versioning the Interfaces

At the beginning of time, your RPC API is at version 1.0. As you evolve or expand it, you need to bump the version number in order to communicate the changes that were made. Whether you bump the major or minor depends on what you’re doing. Minor changes are for small additive revisions, or where the server can accept anything at a certain version back to a base level. The base level is a major version, and you bump to the next one when you need to drop compatibility with a bunch of minor revisions you’ve made. When you do that, of course, you need to support both major versions for a period of time so that deployers can upgrade all their clients to send the new major version before the servers drop support for the old one. I’ll focus on minor revisions below and save major bumps for a later post.

In Nova (and other projects), APIs are scoped to a given topic and each of those has a separate version number and implementation. The client side (typically rpcapi.py) and the server side (typically manager.py) both need to be concerned with the current and previous versions of the API and thus any change to the API will end up modifying both. Examples of named APIs from Nova are “compute”, “conductor”, and “scheduler”. Each has a client and a server piece, connected over the message bus by a topic.

First, an example of a minor version change from Nova’s compute API during the Juno cycle. We have an RPC call named rescue_instance() that needed to take a new parameter called rescue_image_ref. This change is described in detail below.

Server Side

The server side of the RPC API (usually manager.py) needs to accept the new parameter, but also tolerate the fact that older clients won’t be passing it. Before the change, our server code looked like this:

What you see here is that we’re currently at version 3.23 and rescue_instance() takes two parameters: instance and rescue_password (self and context are implied). In order to make the change, we bump the minor version of the API and add the parameter as an optional keyword argument:

Now, we have the new parameter, but if it’s not passed by an older client, it will have a default value (just like Python’s own method call semantics). If you change nothing else, this code will continue to work as it did before.

It’s important to note here that the target version that we changed doesn’t do anything other than tell the oslo.messaging code to allow calls that claim to be at version 3.24. It isn’t tied to the rescue_instance() method directly, nor do we get to know what version a client uses when they make a call. Our only indication is that rescue_image_ref could be non-None, but that’s all we should care about anyway. If we need to be able to pass None as a valid value for the parameter, we should use a different sentinel to indicate that the client didn’t pass any value.

Now comes the (potentially) tricky part. The server code needs to tolerate calls made with and without the rescue_image_ref parameter. In this case, it’s not very complicated: we just check to see if the parameter is None, and if so, we look up a default image and carry on. The actual code in nova has a little more indirection, but it’s basically this:

def rescue_instance(self, context, instance, rescue_password,
rescue_image_ref=None):
if rescue_image_ref is None:
# NOTE(danms): Client is old, so mimic the old behavior
# and use the default image
# FIXME(danms): Remove this in v4.0 of the RPC API
rescue_image_ref = get_default_rescue_image()
....

Now, the rest of the code below can assume the presence of rescue_image_ref and we’ll be tolerant of older clients that expected the default image, as well as newer clients that provided a different one. We made a NOTE indicating why we’re doing this, and left a FIXME to remove the check in v4.0. Since we can’t remove or change parameters in a minor version, we have to wait to actually make rescue_image_ref mandatory until v4.0. More about that later.

Client Side

There is more work to do before this change is useful: we need to make the client actually pass the parameter. The client part is typically in rpcapi.py and is where we also (conventionally) document each change that we make. Before this change, the client code for this call looked like this (with some irrelevant details removed for clarity):

While the actual method is a little more complicated because it has changed multiple times in the 3.x API, this is basically what it looks like ignoring that other change. We take just the instance and rescue_password parameters, declare that we’re using version 3.0 and make the cast which sends a message over the bus to the server side.

In order to make the change, we add the parameter to the method, but we only include it in the actual RPC call if we’re “allowed” to send the newer version. If we’re not, then we drop that parameter and make the call at the 3.0 level, compatible with what it was at that time. Again, with distractions removed, the new implementation looks like this:

As you can see, we now check to see if version 3.24 is allowed. If so, we include the new parameter in the dict of parameters we’re going to use for the call. If not, we don’t. In either case, we send the version number that lines up with the call as we’re making it. Of course, if we were to make multiple changes to this call in a single major version, we would have to support more than two possible outbound versions (like this). The details of how client_can_send_version() knows what versions are okay will be explained later.

Another important part of this change is documenting what we did for later. The convention is that we do so in a big docstring at the top of the client class. Including as much detail as possible will definitely be appreciated later, so don’t be too terse. This change added a new line like this:

* 3.24 - Update rescue_instance() to take optional
rescue_image_ref

In this case, this is enough information to determine later what was changed. If multiple things were changed (multiple new arguments, changes to multiple calls, etc) they should all be listed here for posterity.

So, with this change, we have a server that can tolerate calls from older clients that don’t provide the new parameter, and a client that can make the older version of the call, if necessary. This was a pretty simple case, of course, and so there may be other changes required on either side to properly handle the fact that a parameter can’t be passed, or that some piece of data isn’t received. Here it was easy for the server to look up a suitable value for the missing parameter, but it may not always be that easy.

Gotchas and special cases

There are many categories of changes that may need to be made to an RPC API, and of course I cheated by choosing the easiest to illustrate above. In reality, the corner cases are most likely to break upgrades, so they deserve careful handling.

The first and most important is a change that alters the format of a parameter. Since the server side doesn’t receive the client’s version, it may have a very hard time determining which format something is in. Even worse, such a change may occur deep in the DB layer and not be reflected in the RPC API at all, which could result in a client sending a complex structure in a too-old or too-new format for the server to understand, and no version bump was made at all to indicate to either side that something has changed. This case is the reason we started working on what is now oslo.versionedobjects — more on that later.

Another change that must be handled carefully is the renaming or removal of a parameter. When a call is dispatched on the server side as a result of a received message, it is done so by keyword, even if the method’s arguments are positional. This means that if you change the name of a positional parameter, the server will fail to make the call to your method as if you passed a keyword argument to a python method that it wasn’t expecting. The same goes for a removed parameter of course.

In Nova, we typically handle these by not renaming things unless it’s absolutely necessary, and never removing any parameters until major version bumps. If we do rename a parameter, we continue to accept both and honor them in order in the actual implementation, the newer taking precedence if both are provided.

Version Pins

Above, I waved my hands over the can_send_version() call, which magically knew whether we could send the newer version or not. In Nova, we (currently) handle this by allowing versions for each service to be pinned in the config file. We honor that pin on the client side in the initialization of the RPC API class like this:

What this does is initialize our base version to 3.0, and then calculate the version_cap, if necessary that our client should obey. To make it easier on the operators, we define some aliases, allowing them to use release names in the config file instead of actual version numbers. So, we get the version_cap, which is either the alias based on the config, or the actual value from the config if there is no alias, or None if they didn’t set it. When we initialize the client, it gets the version that matches their alias, the version they specified, or None (i.e. no limit) if not. This is what makes the can_send_version() method able to tell us whether a given version is okay to use (i.e. if it’s below the version_cap, if one is set).

What services/APIs should be pinned, when, and to what value will depend on the architecture of the project. In Nova, during an upgrade, we require the operators to upgrade the control services before the compute nodes. This means that when they’ve upgraded from, say Juno to Kilo, the control nodes running Kilo will have their compute versions pinned to the Juno level until all the computes are upgraded. Once that happens, we know that it’s okay to send the newer version of all the calls, so the version pin is removed.

Aside from the process of bumping the major version of the RPC API to drop compatibility with older nodes, this is pretty much all you have to do in order to make your RPC API tolerate mixed versions in a single deployment. However, as described above, there is a lot more work required to make these interfaces really clean, and not leak version-specific structures over the network to nodes that potentially can’t handle them.