Imagine that two colleagues, Alice and Bob, issue a command to launch a new virtual machine at approximately the same moment in time. Both Alice’s and Bob’s virtual machines must be given an IP address within the range of IP addresses granted to their project. Let’s say that range is 192.168.20.0/28, which would allow for a total of 16 IP addresses for virtual machines [1]. At some point during the launch sequence of these instances, Nova must assign one of those addresses to each virtual machine.

How do we prevent Nova from assigning the same IP address to both virtual machines?

In this blog post, I’ll try to answer the above question and shed some light on issues that have come to light about the way in which OpenStack projects currently solve (and sometimes fail) to address this issue.

Demonstrating the problem

figure A

Dramatically simplified, the launch sequence of Nova looks like figure A. Of course, I’m leaving out hugely important steps, like the provisioning and handling of block devices, but the figure demonstrates the important steps in the launch sequence for the purposes of our discussion here. The specific step in which we find our IP address reservation problem is the determine networking details step.

figure B

Now, within the determine networking details step, we have a set of tasks that looks like figure B. All of the tasks except the last revolve around interacting with the Nova database [2]. The tasks are all pretty straightforward: we grab a record for a “free” IP address from the database and mark it “assigned” by setting the IP address record’s instance ID to the ID of the instance being launched, and the host field to the ID of the compute node that was selected during the determine host machine step in figure A. We then save the updated record to the database.

OK, so back to our problem situation. Imagine if Alice and Bob’s launch requests were made at essentially the same moment in time, and that both requests arrived at the start of the determine networking details step at the same point in time, but that the tasks from figure B are executed in an interleaved fashion between Alice and Bob’s requests like figure C shows.

figure C

If you step through the numbered actions in both Alice and Bob’s request process, you will notice a problem. Actions #7 and #9 will both return the same IP address information to their callers. Worse, the database record for that single IP address will show the IP address is assigned to Alice’s instance, even though Bob’s instance was (very briefly) assigned to the IP address because the database update in action #5 occurred (and succeeded) before the database update in action #8 occurred (and also succeeded). In the words of Mr. Mackey, “this is bad, m’kay”.

There are a number of ways to solve this problem. Nova happens to employ a traditional solution: database-level write-intent locks.

Database-level Locking

At its core, any locking solution is intended to protect some critical piece of data from simultaneous changes. Write-intent locks in traditional database systems are no different. One thread announces that it intends to change one or more records that it is reading from the database. The database server will mark the records in question as locked by the thread, and return the records to the thread. While these locks are held, any other thread that attempts to either read the same records with the intent to write, or write changes to those records, will get what is called a lock wait.

Only once the thread indicates that it is finished making changes to the records in question — by issuing a COMMIT statement — will the database release the locks on the records. What this lock strategy accomplishes is prevention of two threads simultaneously reading the same piece of data that they intend to change. One thread will wait for the other thread to finish reading and changing the data before its read succeeds. This means that using a write-intent lock on the database system results in the following order of events:

figure D

For MySQL and PostgreSQL, the SQL keyword that is used to indicate to the database server that the calling thread intends to change records that it is asking for is called SELECT ... FOR UPDATE.

Using a couple MySQL command-line client sessions, I’ll show you what affect this SELECT FOR UPDATE construct has on a normal MySQL database server (though the effect is identical for PostgreSQL). I created a test database table called fixed_ips that looks like the following:

I’ve highlighted in red above the important things to note about the interplay between session A and session B. The 42.03 seconds is important: it shows the amount of time the SELECT ... FOR UPDATE statement waited on the write-intent locks held by session A. Secondly, the 3 returned by session B’s SELECT ... FOR UPDATE statement indicates that a different row was returned for the same query that session A issued. In other words, MySQL waited until session A issued a COMMIT before executing session B’s SELECT ... FOR UPDATE statement.

In this way, the write-intent locks constructed with the SELECT ... FOR UPDATE statement prevent the collision of threads changing the same record at the same time.

How locks “fail” with MySQL Galera Cluster

At the Atlanta design summit, I co-led an Ops Meetup session on databases and was actually surprised by my poll of who was using which database server for their OpenStack deployments. Out of approximately 220 people in the room, MySQL Galera Cluster was by far the most popular way of deploying MySQL for use by OpenStack services, with around 200 or so operators raising their hands that they used it. Standard MySQL was next, and there was one person using PostgreSQL.

MySQL Galera Cluster is a system that wraps the standard MySQL row-level binary replication log transmission with something called working-set replication, enabling synchronous replication between many nodes running the MySQL database server. Now, that’s a lot of fancy words to really say that Galera Cluster allows you to run a cluster of database nodes that do not suffer from replication slave lag. You are guaranteed that the data on disk on each of the nodes in a Galera Cluster is exactly the same.

One interesting thing about MySQL Galera Cluster is that it can efficiently handle writes to any node in the cluster. This is different from standard MySQL replication, which generally relies on a single master database server that handles writes and real-time reads, and one or more slave database servers that serve read requests from applications that can tolerate some level of lag between the master and slave. Many people refer to this setup as multi-master mode, but that is actually a misnomer, because with Galera Cluster, there is no such thing as a master and a slave. Every node in a cluster is the same. Each can apply writes coming to the node directly from a MySQL client. For this reason, I like to refer to such a setup as multi-writer mode.

This ability to have writes be directed to and processed by any node in the Galera Cluster is actually pretty awesome. You can direct a load balancer to spread read and write load across all nodes in the cluster, allowing you to scale writes as well as reads. This multi-writer mode is ideal for WAN-replicated environments, believe it or not, as long as the amount of data being written to is not crazy-huge (think: Ceilometer), because you can have application servers send writes to the closest database server in the cluster, and let Galera handle the efficiency of transmitting writesets across the WAN.

However, there’s a catch. Peter Boros, a principal architect at Percona, a company that makes a specialized version of Galera Cluster called Percona XtraDB Cluster, was actually the first to inform the OpenStack community about this catch — in the aforementioned Ops Meetup session. The problem with MySQL Galera Cluster is that it does not replicate the write-intent locks for SELECT ... FOR UPDATE statements. There’s actually a really good reason for this. Galera does not have any idea about the write-intent locks, because those locks are constructions of the underlying InnoDB storage engine, not the MySQL database server itself. So, there’s no good way for InnoDB to communicate to the MySQL row-based replication stream that write-intent locks are being held inside of InnoDB for a particular thread’s SELECT ... FOR UPDATE statement [3].

Figure E

The ramifications of this catch are interesting, indeed. If two application server threads issue the same SELECT ... FOR UPDATE request to a load balancer at the same time, which directs each thread to different Galera Cluster nodes, both threads will return the exact same record(s) with no lock waits[4]. Figure E illustrates this phenomenon, with the circled 1, 2, and 3 events representing things occurring at exactly the same time (due to no locks being acquired/held).

One might be tempted to say that Galera Cluster, due to its lack of support for SELECT ... FOR UPDATE write-intent locks, is no longer ACID-compliant, since now two threads can simultaneously select the same record with the intent of changing it. And while it is indeed true that two threads can select the same record with the intent of changing it, it is extremely important to point out that Galera Cluster is still ACID-compliant.

The reason is because even though two threads can simultaneously read the same record with the intent of changing it (which is the identical behaviour that would be seen if the FOR UPDATE was left off the SELECT statement), if both threads attempt to write a change to the same record via an UPDATE statement, either one or none of the threads would succeed in updating the record, but not both. The reason for this is in the way that Galera Cluster certifies a working set (the set of changes to data). If node 1 writes an update to disk, it must certify with a quorum of nodes in the cluster that its update does not conflict with updates to those nodes. If node 3 has begun changing the same row of data, but has not certified with the other nodes in the cluster for that working set, then it will fail to certify the original working set from node 1 and will send a certification failure back to node 1.

This certification failure manifests itself as a MySQL deadlock error, specifically error 1213, which will look like this:

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

All nodes other than the one that first “won” — i.e. successfully committed and certified its transaction — will return this deadlock error to any other thread that attempted to change the same record(s) at the same time as the thread that “won”. Need a visual of all this interplay? Check out figure F, which I scraped together for the graphically-inclined.

Figure F

If you ever wondered why, in the Nova codebase, we make prodigious use of a decorator called @_retry_on_deadlock in the SQLAlchemy API module, it is partly because of this issue. These deadlock errors can be consistently triggered by running load tests or things like Tempest that can put a load on the database that forces “hot spots” in the data to occur. This decorator does exactly what you’d think it would do: it retries the transaction if a deadlock error is returned from the database server.

So, given what we know about MySQL Galera Cluster, one thing we are trying to do is entirely remove any use of SELECT ... FOR UPDATE from the Nova code base. Since we know it doesn’t work the way people think it works on Galera Cluster, we might as well stop using this construct in our code. However, the retry-on-deadlock mechanism is actually not the most effective or efficient mechanism we could use to solve the concurrent update problems in the Nova code base. There is another technique, which I’ll call compare and swap, which offers a variety of benefits over the retry-on-deadlock technique.

Compare and swap

One of the drawbacks to the retry-on-deadlock method of handling concurrency problems is that it is reactive by nature. We essentially wrap calls that may tend to deadlock with a decorator that catches the deadlock error if it arises and retry the entire database transaction again. The problem with this is that the deadlock error that manifests itself from the Galera Cluster working set certification failure (see Figure F above) takes some non-insignificant amount of time to occur.

Think about it. A thread manages to start a write transaction on a Galera Cluster node. It writes the transaction on the local node and gets all the way up to the point of doing the COMMIT. At that point, the node sends out a certification request to each node in the cluster (in parallel). It must wait until a quorum of those nodes respond with a successful certification. If another node has an active working set that changes the same modified rows, then a deadlock will occur, and that deadlock will eventually bubble its way back to the caller, who will retry the exact same database transaction. All of these things, while individually very quick in Galera Cluster, do take some amount of time.

What if we used a technique that would allow us to structure our SQL statements in such a way that we can avoid the roundtrips from one Galera Cluster node to the other nodes? Well, there is.

Consider the following SQL statements, taken from the above CLI examples:

Now, we know that the locks taken for the FOR UPDATE statement won’t actually be considered by any other nodes in a Galera Cluster, so we need to get rid of the use of SELECT ... FOR UPDATE. But, how can we structure things so that the SQL code sent to any node in the Galera Cluster will guarantee to us that we will neither stumble into a deadlock error and that the cluster node we end up executing our statements on will not need to contact any other node to determine that another thread has updated the same record during the time that we SELECT‘d our record and when we go to UPDATE it?

The answer lies in constructing an UPDATE statement that contains a WHERE clause that contains all the fields from the previously SELECT‘ed record, like so:

/* Grab the "first" unassigned IP address */SELECT id FROM fixed_ips
WHERE host ISNULLAND instance_id ISNULLORDERBY id
LIMIT1;
/* Let's assume that the above query returned the
fixed_ip with ID of 1
We now "assign" the IP address to instance #42
and on host #99, but specify that the host and
instance_id fields must match our original view
of that record -- i.e., they must both be NULL
*/UPDATE fixed_ips
SET host =99, instance_id =42WHERE id =1AND host ISNULLAND instance_id ISNULL;

If we structure our application code so that it is executing the above SQL statements, each statement can be executed on any node in the cluster, without waiting for certification failures to occur before “knowing” if the UPDATE would succeed. Remember that working set certification in Galera only happens once the local node (i.e. the node originally receiving the SQL statement) is ready to COMMIT the changes. Well, if thread B managed to update the fixed_ip record with id = 1 in between the time when thread A does its SELECT and the time thread A does its UPDATE, then the WHERE condition:

WHERE id =1AND host ISNULLAND instance_id ISNULL;

Will fail to select any any rows in the database to update, since host IS NULL AND instance_id IS NULL will no longer be true if another thread updated the record. We can catch this failure to update any rows in the database more efficiently than the certification timeout, since the thread that sent the UPDATE ... WHERE ... host IS NULL AND instance_id IS NULL statement will receive notification about no rows updated before any certification traffic would ever be generated (since there’s no certification needed if nothing was updated).

Do we still need a retry mechanism? Yes, of course we do, in order to retry the SELECT, then UPDATE ... WHERE statements when a previous UPDATE ... WHERE statement returned zero rows affected. The difference between this compare-and-swap approach and the brute-force retry-on-deadlock approach is that we’re no longer reacting to an exception being emitted after some timeout of certification, but instead being proactive and just structuring our UPDATE statement to pass in our previous view of the record we want to change, allowing for a tighter retry loop to occur (no timeout waits needed, simply detect whether rows_affected is greater than zero).

This compare and swap mechanism is what I describe in the lock-free-quota-management Nova blueprint specification. There’s been a number of mailing list threads and IRC conversations about this particular issue, so I figured I would write a bit and create some pretty graphics to illustrate the sequencing of events that occurs. Hope this has been helpful. Let me know if you have thoughts on the topic or see any errors in my work. Always happy for feedback.

[1] This is just for example purposes. Technically, such a CIDR would result in 13 available addresses in Nova, since addresses for the gateway, cloudpipe VPN, and broadcast addresses are reserved for use by Nova.

[2] We are not using Neutron in our example here, but the same general problem resides in Neutron’s IPAM code as is described in this post.

[3] Technically, there are trade-offs between pessimistic locking (which InnoDB uses locally) and optimistic locking (that Galera uses in its working-set certification. For an excellent read on the topic, check out Jay Janssen‘s blog article on multi-node writing and deadlocks in Galera.

[4] If both threads happened to hit the same Galera Cluster node, then the last thread to execute the SELECT ... FOR UPDATE would end up waiting for the locks (in InnoDB) on that particular cluster node.

In a conversation on Twitter, Lydia Leongstated something that I’ve heard from a number of industry folks and OpenStack insiders alike:

The core needs to be small, rock-solid stable, and readily extensible.

I responded:

Depends on what you mean by “core” I think that term has been abused.

It’s probably worth writing down a response that spans more than 140 characters, so I decided to write a post about why the term “core” is, indeed, abused, and some of my thoughts about Lydia’s pronouncement.

First, on specificity

In my years working in the cloud space, it has dawned on me that there really is no single way of looking at the cloud development and deployment space. As soon as one person (myself included) tries to describe with any detail what cloud systems are, invariably someone else will say “no, it’s not (just) that… it’s this as well.”

For instance, if I say that cloud is on-demand computing that gives application developers tools to drive their own deployment, someone will correctly point out that cloud is also hardware that has been virtualized to fit budgetary and technological needs of an IT department.

Similarly, if I said that cloud was all about treating VMs as cattle, someone will rightly come along and say that legacy applications and “pet VMs” have as much of a right to benefit from virtualized infrastructure as those hipster-scale folks.

One man’s young dame is another’s old woman. (http://bit.ly/to-each-his-own)

And, then John Dickinson will appropriately say, “hey, Jay, cloud isn’t all about compute, you know.” And of course, he’d be totally correct.

My point is that, well, cloud means something different to everyone. And there’s really nothing wrong with that. It just means that when you express an idea about the cloud space, you should qualify exactly what it is you are applying that idea to, and conversely, what you are not applying your idea to.

And, of course, Twitter, being limited in its conversational envelope size, is hardly an ideal medium to express grand ideas about a space such as “the cloud” that already suffers from a dearth of crisp definition.

On golf balls, layercake, and taxonomy

Forrester, Gartner and other companies obsessively attempt to categorize and rank companies and products in ways that they feel are helpful to their CIO/tech buyer audience. And that’s fine; everybody’s got to make a living out here.

But lately, it seems the OpenStack developer community has gotten gung-ho about categorizing various OpenStack projects. I want to summarize here some of the existing thoughts on the matter, before delving into my own personal opinions.

Late last year, Dean Troyeroriginally posted his ideas about categorizing projects within the OpenStack arena using a set of layers. His ideas were centered around finding a way to describe new projects in a technical (as opposed to trademark or political) sense, in order to more easily identify where the project fit in relation to other projects. The impetus for Dean’s “OpenStack Layers” approach was his work in DevStack, in trying to detect the boundaries and dependencies between components that DevStack configures.

Sean Dague more recently expanded on Dean’s ideas, attempting to further clarify where newer (since Dean’s post) incubated and integrated projects lie in the OpenStack layers. Monty Taylor and Robert Collins followed up Sean’s post with posts of their own, each attempting to further provide ways in which OpenStack projects may be grouped together.

Sean’s model extended Dean’s, adding in a couple of the projects like Barbican that had not really been around at the time of Dean’s post, and calling the turtle layer “Consumption Services”:

Monty’s model grouped OpenStack projects in yet a different manner:

Layer #1 The Only Layer — Any project that is needed to spin up a VM running WordPress on it. His list here is Keystone, Glance, Nova, Cinder, Neutron and Designate. Monty believes this is where the layer model falls apart, and all other terms should be simple “tags” that describe some aspect of a project

Tag: Cloud Native — Any project “that provide(s) features for end user applications which could be provided by services run within VMs instead.” Examples here are Trove and Swift. Trove provides managed databases in VMs instead of databases running on bare metal. Swift provides object storage instead of the user having to run their own bare-metal machines with lots of disk space.

Tag: Operations — Any project that bridges functional gaps between various components for operators of OpenStack clouds. Examples here include Ceilometer and Ironic.

Tag: User Interface — Any project that enhances the user interface to other OpenStack services. Examples here are Horizon, Heat, and the openstack-sdk

http://bit.ly/flaming-golf-ball

Thierry Carrezredefined Monty’s “Layer #1″ as “Ring 0″. His depiction of OpenStack is like the construction of a golf ball, with a small rubber core and a larger plastic covering[1], rather than the layercake of Sean and Dean. In Thierry’s model, the OpenStack projects that would live in “Ring 0″ would be the projects that would need to be “tightly integrated” and “limited-by-design”. All other projects would just live outside this Ring 0, but still be “in OpenStack”. Thierry also brings up the question of what to do about the concept of Programs, which to this date, are the official way in which the OpenStack community blesses teams of people working towards common missions. At least, that’s what Programs are supposed to be. In practice, they tend to be viewed as equal to the main project that the Program began life as.

Finally, Robert Collins’ take on the layers discussion brought up a couple interesting points about the importance of APIs vs implementation (which I will cover shortly) and that the “core” of OpenStack really is smaller than Monty envisioned, but Robert ended up with a taxonomy of OpenStack projects that was broken up by functional categories. There would be a set of teams that would select which projects implemented an API that belonged to one of the following functional categories:

IaaS product: selects components from the tent to make OpenStack/IaaS

PaaS product: selects components from the tent to make OpenStack/PaaS

CaaS product: (containers)

SaaS product: (storage)

NaaS product: (networking – but things like NFV, not the basic Neutron we love today). Things where the thing you get is useful in its own right, not just as plumbing for a VM.

So why do we insist on categorizing these projects?

All of the above OpenStack leaders try to get at the fundamental question of what is the core of OpenStack? But, really, what is this obsession we have with shoving OpenStack projects into these broad categories? Why do we need to constantly define what this core thing is? And what does Lydia refer to when she says that the “core needs to be small, rock-solid stable, and readily extensible”?

I am going to posit that labeling some set of projects “the OpenStack Core” is actually not useful to any of our users[2], and that replacing overloaded terms such as “core” or “integrated” or “incubated” with a set of informative tags for each project will lead to a more enlightened OpenStack user community.

Monty started his blog post with a discussion about the different types of users that we serve in the OpenStack community. I think when we answer questions about OpenStack projects, we always need to keep in mind what information is actually useful to these different user groups, and why that information is useful. When we understand the characteristics that make a certain piece of information useful to a group of users, we can emphasize those characteristics in the language we use to describe OpenStack.

Operators

Let’s take the group of OpenStack users that Monty calls “deployers”, that I like to call “operators”. These are folks that have deployed OpenStack and are running an OpenStack cloud for one or more users. What kinds of information do these users crave? I can think of a number of questions that this user group frequently asks:

Should I deploy OpenStack using an OpenStack distribution like RDO, or should I deploy OpenStack myself using something like DevStack or maybe the Chef cookbooks on Stackforge?

Is the Icehouse version of Nova more stable than Havana?

Does Ceilometer’s SQL driver handle 2000 or more VMs?

What notification queues can my Nagios NPRE plugin monitor to get an early warning sign that something is degraded?

Can my old Grizzly nova-network deployment be upgraded to Icehouse Neutron with no downtime to VM networking?

What is the best way to diagnose data plane network connectivity issues with Havana Neutron deployments that use OpenVSwitch 1.11 and the older OpenVSwitch agent with ML2?

Is Heat capable of supporting 100 concurrent users?

For operators, questions about OpenStack projects generally revolve around stability, performance & scalability and deployment & diagnostics. They want to know practical information on their options with regards to how they can deploy OpenStack, keep it up and running smoothly, and maintain it over time.

What does defining a set of “core OpenStack projects” give the operator user group? Nothing.

What might we be able to tag OpenStack projects with that would be of use to operators? Well, I think the answer to this question comes in the form of answers to the questions that these operators frequently ask. How’s this for a set of tags that might be useful to operators?

included-in-$distribution-$version: Indicates a project has been packaged for inclusion in some OpenStack distribution. Examples: included-in-rdo-icehouse, included-in-uca-trusty, included-in-mos-5.1

stability-$rating: Indicates the operator community’s viewpoint on the stability of a project. Examples: stability-experimental, stability-improved, stability-mature

driver-$driver-experimental: Indicates the developers of driver $driver consider the code to be experimental. Examples: driver-sql-experimental, driver-docker-experimental

puppet-tested: Indicates that Puppet modules exist in the openstack/ code namespace that are functionally tested to install and configure the service. Similar tags could exist for chef-tested or ansible-tested, etc

operator-docs[-$topic]: Indicates that there is Operator-specific documentation for the project, optionally with a $topic suffix. Examples: operator-docs-nagios, operator-docs-monitoring

rally-verified-sla: Indicates the project has one or more Rally SLA definitions included in its gate testing platform

upgrade-from-$version-(in-place|downtime-needed): Indicates a project can be upgraded with or without downtime from some previous $version. Examples: upgrade-from-icehouse-in-place, upgrade-from-juno-downtime-needed

I personally feel all of the above tags contain more useful information for operators than having a set of “core OpenStack projects”.

Application developers (end users and DevOps folks)

Monty calls this group of users “end users”. The end users of clouds are application developers and DevOps people who support the operation of an application on a cloud environment. Again, let’s take a look at what this group of users typically cares about, in the form of frequent questions that this user group poses:

cloud-native: Indicates the project implements a service designed for applications built for the cloud

user-docs[-$topic]: Indicates there is documentation specifically for application developers around this project, optionally for a smaller $topic. Examples: user-docs-nagios, user-docs-fog, user-docs-wordpress

compare-to-$subject: Indicates that the service provides similar functionality to $subject. Examples: compare-to-s3, compare-to-cloud-foundry

compat-with-$api: Indicates the project exposes functionality that allows $api to be used to perform some native action. Examples: compat-with-ec2-2013.10.9, compat-with-glance-v2. Optionally consider a -with-docs suffix to indicate documentation on the compatibility exists

You will probably note that many of the tags for application developers involve the existence of documentation about compatibility with, or comparison to, other APIs or services. This is because docs really matter to application developers. API docs, tutorials, usage examples, SDK documentation. All of this stuff is critical to driving strong adoption of OpenStack with cloud app developers. The move to a documentation focus can and should start with a simple set of user-focused tags that inform our application developer users about the existence of official documentation about the things they care about.

Packagers

The next group of OpenStack users are the packagers of the OpenStack projects. Monty calls this group of users the “distributors”. Folks who work on operating system packages of OpenStack projects and libraries, folks who work on the OpenStack distributions themselves, and folks who work on configuration management tool modules that install and configure an OpenStack project are all members of this user group. This group of users care about a number of things, such as:

Does the Juno Nova release have support for the version of libvirt on Ubuntu Trusty Tahr?

Does version X of the python-novaclient support version Y of the Nova REST API?

Can Havana Nova speak with an Icehouse Glance server?

Which Keystone token driver should I enable by default?

Have database migrations for Icehouse Neutron been tested with PostgreSQL 9.3?

Has the qpid message queue driver been tested with Juno Ceilometer?

What does defining a set of “core OpenStack projects” give the packager user group? Nothing.

You’ll notice that many of the concerns that packagers have revolve around two things: documentation around version dependencies and testing of various optional drivers or settings. What does a designation that a project is “in core OpenStack” answer for the packager? Really, nothing at all. A finer-grained source of information is what is desired. A set of tags would be much more useful:

gated-with-$thing[-$version]: Indicates that patches to the project are gated on successful functional integration testing with $thing at an optional $version. Examples: gated-with-neutron, gated-with-postgresql, gated-with-glanceclient-1.1

tested-with-$thing[-$version]: Indicates that functional integration tests are run post-merge against the project. Examples: tested-with-ceph, tested-with-postgresql-9.3

driver-$driver-(recommended|default|gated|tested): Indicates a $driver is recommended for use, is the default, or is tested with some gate or post-merge integration tests. Examples: driver-sql-recommended, driver-mongodb-gated

OpenStack Developers

OK, so the final group of OpenStack users is the developers of the OpenStack projects themselves. This is the group that does the most squealing about categorizing projects in and out of the OpenStack tent. We’ve created a governance structure and organizational model that, as Zane Bitter said, is like the reverse of Conway’s law. We tend to box ourselves in because we focus so intently on creating these categories by which we group the OpenStack projects.

We created the terms “incubated” and “integrated” to ostensibly inform ourselves of which projects have aligned with the OpenStack governance model and infrastructure tooling processes. And then we created the gate as a model of those categories, without first asking ourselves whether the terms “incubated” and “integrated” were really serving a technically sound purpose. We group all the integrated and incubated projects together, saying that all these projects must integrate with each other in one giant, complex gate testing platform.

Zane thinks that is madness, and I tend to agree.

Personally, I feel we need to stop thinking in terms of “core” or “incubated” or “integrated” and instead stick to thinking about the projects that live in the OpenStack tent[3] in terms of the soft and hard dependencies that each project has to another. In graphical representation, the current set of OpenStack projects (that aren’t software libraries) might look something like this:

The YAML above is nothing more than a set of tags that could be applied to OpenStack projects to inform developers about their relative dependency on other projects. This dependency graph information can then be used to construct a testing platform that should be more efficient than the current system that looks the way it does because we’ve insisted on using this single “integrated” category to describe the relationship of an OpenStack project with another.

So, my answer to Lydia

Seems I’ve yet again exceeded my 140 character limit. And I haven’t addressed Lydia’s point about the “core needs to be small, rock-solid stable, and readily extensible.” I think you can probably tell by now that I don’t think there is a useful way of denoting “a core OpenStack”. At least, not useful to our users. Does this mean that I disagree with Lydia’s sentiment about what is important for the OpenStack contributor community to focus on? Absolutely not.

So, in the spirit of specificity and useful taxonomic tagging, I will go further and make the following statements about the OpenStack community and code using Lydia’s words.

Our public REST APIs need to be designed to solve small problems

Right now, the public REST APIs of a number of OpenStack projects suffer from what I like to call “extension-itis”. Some of the REST APIs honestly feel like they were stitched together by a group of kindergartners, each trying to purposefully do something different than the last kid who attached their little piece of the quilt. APIs need to be small, simple, and clear. Why? Because they are our first impression, and OpenStack will live and die by the grace of the application developers that use our APIs.

We should have a working group composed of members of the Technical Committee, interested operators and end users, and API standards experts that control the public face of OpenStack: its HTTP REST APIs. This working group should have teeth to it, and be able to enforce a set of rules across OpenStack projects about API consistency, resource naming, clarity, discoverability and documentation

Our deployment tools need to be as rock-solid stable as any other part of our code

I’m not just talking about the official OpenStack Deployment Program here (Triple-O). I’m talking about the combination of Chef cookbooks, Puppet modules, Ansible playbooks, Salt state files, DevStack scripts, Mirantis FUEL deployment tooling, and the myriad other things that deploy and configure OpenStack services. They need to be tested in our continuous integration platform with the same zeal as everything else. As a community, we need to dedicate resources to make this a reality.

Our governance model needs to be readily extensible

Unfortunately, I believe our governance model and organizational structure has a tendency to reinforce the status quo and not adapt to changing times as quickly as it should. Examples of this rigidity are plenty. See, for example, the inability of the Technical Committee to come up with a single set of criteria for Zaqar’s graduation from incubation status. As a member of the Technical Committee, I can say I was pretty embarrassed at the way we treated the Zaqar contributor community. I’d love for the role of the Technical Committee to change from one of Judge to one of Advisor. The current Supreme OpenStack Court vision just reinforces the impression that our community cannot find a meaningful balance between progress and stability, and I for one refuse to admit that such a balance cannot be reached.

The current organization of official OpenStack Programs is something I also believe is no longer useful, and discourages our community from extending itself in the ways we should be extending. For example, we should be encouraging the sort of competition and pluralism that having multiple overlapping services like Ceilometer, Monasca, and Stacktach all under the OpenStack tent would bring.

There is no core

Finally, unless it’s not obvious, I don’t believe there is any OpenStack core. Or at least, that there’s any use to spending the effort required to concoct what it might be.
OK, I think that’s just about enough for one night of writing. Thanks very much if you made it all the way through. I look forward to your comments.

[1]OK, yes, golf balls have more than just a core and a larger plastic covering…just go with me on this and stop being so argumentative.

[2]Denoting some subset of OpenStack projects as “core” may be useful to industry pundits and marketing folks, but those aren’t OpenStack users.

Underneath these discussions, however, is an even more pivotal question that remains to be answered. Chris Dent put it well in his recent response on the Zaqar thread:

http://bit.ly/big-top-tent

Which leads inevitably to the existential questions of What is OpenStack in the business of? and How do we stay sane while being in that business?

I have a few thoughts on these questions. A definitive opinion, if you will. Please permit me to wax philosophical for a bit. To be fair, my thoughts on this topic have changed pretty dramatically over the last few years, and as my role in the OpenStack world has evolved from developer to operator/deployer to developer again, from being on the Project Policy Board (remember that?), to being on the Technical Committee, off the TC, and then on it again.

What is OpenStack in the business of?

I believe OpenStack is in the business of providing an extensible cloud platform built on open standards and collaboration.

In other words, we should be a Big Tent. Our policies and our governing structure should all work to support people that collaborate with each other to enhance the experience of users of this extensible cloud platform.

What would being a Big Tent mean for the OpenStack community?

To me, being a Big Tent means welcoming projects and communities that wish to:

“speak” the OpenStack APIs

extend OpenStack’s reach, breadth, and interplay with other cloud systems

enhance the OpenStack user experience

Note that I am deliberately using the generic term “user” in the last bullet point above. Users include application developers writing against one or more public OpenStack APIs, developers of the OpenStack projects, deployers of OpenStack, operators of OpenStack, and anyone else who interfaces with OpenStack in any way.

What would being under the OpenStack tent mean for a project?

What would “being under the tent” mean for a project and its community? I think there’s a pretty simple answer to this as well.

A project included in the OpenStack Big Tent would mean that:

The project is recognized as being a piece of code that enhances the experience of OpenStack users

The project’s source code lives in the openstack/ code namespace

Clearly, I’m talking here only about projects that wish to be included in the OpenStack tent here. I’m not suggesting that the OpenStack community go out and start annexing parts of other communities and bringing their source code into the OpenStack tent willy-nilly!

…and what would being under the tent not mean for a project?

Likewise, being under the OpenStack Big Tent should not mean any of the following:

That the project is the only way to provide a specific function to OpenStack users

That the project is the best way to provide a specific function to OpenStack users

That the project receives some allotted portion of shared technical or management resources

Note that the first two bullet points above go towards my opinion that the OpenStack community should embrace competition both within its ecosystem as well as embrace competition in the external community by identifying ways to increase interoperability. Specifically, I don’t have a problem with having projects under the OpenStack tent that share some common functionality or purpose.

What requirements must projects meet?

Finally, there is the question of what requirements a project that wishes to be under the OpenStack tent must meet? I think minimal is better here, as it lowers the barriers to entry for interested parties. I think the below items keep the bar high enough to ensure that applicants aren’t just trying to self-promote without contributing to the common good of OpenStack.

Source code is licensed under the Apache License or a compatible FLOSS license and code is submitted according to a Developer Certificate of Origin

There should be liaisons for release management and cross-project coordination

There should be documentation that clearly describes the function of the project, its intended audience, any public API it exposes, and how contribution to the project is handled

And that’s it. There would be no further requirements for a project to exist under the OpenStack tent.

Predicting a few questions

I’m going to go out on a limb and try to predict some questions that might be thrown my way about my answer to the existential question of OpenStack. Not only do I believe this is a good exercise for anyone who plans to defend their thoughts in the court of public opinion, but I think that having concrete answers to these questions will help some folks recognize how this change in overall OpenStack direction would affect specific decisions and policies that have come to light in recent months.

What about the gate? How will the gate possibly keep up with all these projects?

The gate is the test platform that guards the branches of the OpenStack projects that are in the Integrated OpenStack release against bugs that result from merging flawed code in a dependent project. This test platform consists of hundreds of virtual machines that run a set of integration tests against OpenStack environments created using the Devstack tool.

http://bit.ly/pug-on-gate

The gate is also composed of a group of talented humans who each have demonstrated heroic characteristics in the past four years during periods in which the gate or its underlying components get “wedged” and needs to be “unstuck.”

These talented engineers, known collectively as the OpenStack Infra team, are a resource that is shared among projects in the OpenStack community. Currently, in the OpenStack community means the set of code that lives in the openstack/ namespace AND the stackforge namespace. However, while the OpenStack Infra team is shared among all projects in the OpenStack community, the gate platform is NOT shared by all projects in the community. Instead, the gate platform (the integrated queue) is only available to the projects in the OpenStack community that are in the incubated and integrated project statuses.

Now, it is 100% true that given the current structure of our gate test platform, for each additional project that is included in the current OpenStack Integrated Release, there is an exponential impact on the number of test machines and the amount of test code that will be run in the gate platform. I will be the first to admit that.

However, I believe the gate test platform does not need to impose this limitation on our development and test processes. I believe the gate, in its current form, produces the kind of frustration that it does because of the following reasons:

We don’t use fail-fast policies in the gate

We don’t use hierarchical test construction properly in the gate

We assume bi-directional test dependence when all dependencies are uni-directional

We run tests against code that a patch does not touch

We only have one gate

Expectations of developers trying to merge code into a tree are improperly set

By rethinking what the gate test platform is doing, and especially rethinking the policies that the gate enforces, I think we can test more projects, more effectively. I’m not going to go into minute detail for each of the above items, as I will do that in a followup post. The point here is to step back a bit and see if the policies we’ve given ourselves might just be the cause of our woes, and not the solution we think they might be.

Would there continue to be an incubation process or status?

Short answer: No.

http://bit.ly/incubate-my-chicken

The status of incubation within OpenStack has become the end goal for many projects over the last 3 years. The reason? Like it or not, the reason is because being incubated means the project gets to live in the openstack/ code namespace. And being in the openstack/ code namespace brings an air of official-ness to a project. It’s not rational, and it’s not sensible, but the reality is that many companies will simply not dedicate resources (money and people, which cost money) to the development of a project that does not live in the openstack/ code namespace. If our goal, as a community, is to increase the adoption of projects that enhance the user experience of OpenStack, then the incubation status, and the barrier to inclusion that comes with it, is not helping that goal.

The original aims of the incubation process and status were to push projects that wished to be called “OpenStack” to adopt a consistent release cadence, development workflow and code review system. It was assumed that by aligning these things with other projects under the OpenStack tent, that these projects would become better integrated with each other, by nature of being under the same communal constraints and having to coordinate their development through a shared release manager. These are laudable goals, but in practice, incubation has become more of a status symbol and political campaign process than something that leads to increased integration with other OpenStack projects.

Do we still want to encourage projects that live under the OpenStack tent to work with each other where appropriate? Absolutely yes. However, I don’t think the existing incubation process, which features a graduation review (or three, or four) in front of the Technical Committee, is something that is useful to OpenStack users.

Application developers want to work with solid OpenStack APIs and easy-to-use command line and library interfaces. The incubation process doesn’t address these things specifically.

Deployers of an OpenStack cloud platform want to deploy various OpenStack projects using packages or source repositories (and sometimes both). The process of incubation doesn’t address the stability or cohesiveness of operating system packages for the OpenStack project. This is something that would be better handled with a working group comprised of folks from the OpenStack and operating system distributions, not by a graduation review before the Technical Committee.

Operators of OpenStack cloud platforms want the ability to easily operate, maintain, and upgrade the OpenStack projects that they choose to utilize. Much of the operator tooling and utilities live outside of the openstack/ code namespace, either in the stackforge/ code namespace or in random repositories on GitHub. By not having OpenStack-specific operator tooling in the openstack/ code namespace, we de-legitimize these tools and make the business decision to use them harder. The incubation process, along with the Technical Committee “come before the court” bottleneck, doesn’t enable these worthy tools and utilities to be used as effectively as they could be, which ultimately means less value for our operator users.

Won’t the quality of OpenStack suffer if we let too many projects in?

No, I don’t believe it will. Or at least, I don’t believe that the quality of OpenStack as a whole is a function of how many projects we let live in the openstack/ code namespace. The quality of OpenStack projects should be judged separately, not as a whole. It’s neither fair to older, mature projects to group them with newcomers, noris it reasonable to hold new projects to the standards of stability and quality that 5+ year old software has evolved to.

I think instead of having an OpenStack Integrated Release, we should instead have tags that describe the stability and quality of a particular OpenStack project, along with much more granular integration information.

In other words, as a deployer, I couldn’t care less whether Ceilometer is “in the integrated release.”

What I do care about, however, is whether the package of Neutron I’m installing emits notifications in a format that the package of Ceilometer I’m installing understands and tracks.

As a deployer, I don’t care at all that Swift is “in the integrated release.”

What I do care about is whether my installed Glance package is able to store images in and stream images from a Swift deployment I installed last May.

As a cloud user, I don’t care at all that Nova is “in the integrated release.”

What I do care about is whether the version of Nova installed at my cloud provider will work with the version of python-novaclient I happened to install on my laptop.

So, instead of going through the arduous and less-than-useful process of graduation and incubation, I would prefer we spend our limited resources working on very specific documentation that clearly tells OpenStack users about the tested integration points between our OpenStack client, server and library/dependency projects.

Would MarconiZaqar be in the integrated OpenStack release under this scheme?

No. I don’t believe we need an integrated release. Yes, Zaqar would be under the OpenStack tent, but no, there would be no integrated release.

Would Stacktach, Monasca and Ceilometer both live in the openstack/ code namespace?

If that is something the Stacktack and Monasca communities would like, and once they meets the requirements for inclusion under the OpenStack tent, then yes. And I think that’s perfectly fine. Having Stacktach under the same tent does not mean Ceilometer isn’t a viable option for OpenStack users. Nor does it mean that Stacktach is the best solution in the telemetry space for OpenStack users.

Our decisions should be based on what is best for our OpenStack users. And I think giving our users choices in the telemetry space is what is best for our users. What works for one user may not be the best choice for another user. But, if we focus on the documentation of our projects in the OpenStack tent (see the requirements for projects to be under the tent, above), we can let these competing visions co-exist peacefully while having our user’s best interests in our hearts.

Would Stacktach and Ceilometer both be called “OpenStack Telemetry” then?

No. I don’t believe the concept of Programs is useful. I’d like to get rid of them altogether and replace the current programs taxonomy with a looser tag-based system. Monty Taylor, in this blog post, has some good ideas on that particular topic, including using a specific tag for “production readiness”.

Would the Technical Committee still decide which projects are in OpenStack?

No. I believe the Technical Committee should no longer be this weird court before which prospective projects must plead their case for inclusion into OpenStack. The requirements for a project to be included in the OpenStack tent should be objective and finite, and anyone on the new project-config-core team should be able to vet applicants based on the simple list of requirements.

How would this affect whether some commercial entity can call its product OpenStack?

This would not affect trademark, copyright, or branding policies in any way. The DefCore effort, of which I have pretty much steered as clear of as possible, is free to concoct whatever legal and foundation policies it sees fit with regards to what combinations of code and implementation can be called “powered by OpenStack”, “runs on OpenStack”, or whatever else business folks think gives them a competitive edge. I’ll follow up on the business/trademark side of OpenStack in a separate post, but I want to make it clear that my proposed broadening of the OpenStack tent has no relation to the business side of OpenStack. This is strictly a community approach.

What would happen to the stackforge/ code namespace?

Nothing. Projects that wished to continue to be in the OpenStack ecosystem but do not meet the requirements for being under the OpenStack tent could still live in the stackforge/ code namespace, same as the they do today.

What benefits does this approach bring?

The benefits of this approach are as follows:

Clarity: due to the simple and objective requirements for inclusion into the OpenStack tent, there will be a clear way to make decisions on what should “be OpenStack”. No more ongoing questions about whether a project is “infrastructure-y enough” or aggravation about projects’ missions or functionality overlapping in certain areas.

Specificity: no more vague reference to an “integrated release” that frankly doesn’t convey the information that OpenStack users are really after. Use quality and integration tags to specify which projects integrate with which versions of other projects, and focus on quality, accurate documentation about integration points instead of the incubated and integrated project status symbols and graduation process.

Remove the issue of finite shared resources from the decision-making process: Being “under the OpenStack tent” does not mean that projects under the tent have any special access to or allocation of shared resources such as the OpenStack infrastructure team. Decisions about whether a project should be “in OpenStack” therefore need not be made based on resource constraints or conditions at a specific point in time. This means fewer decisions that look arbitrary to onlookers.

Convey an inclusive attitude: Clearer, simpler, more objective rules for inclusion into the OpenStack tent means that the OpenStack community will present an inclusive attitude. This is the opposite of our community’s reputation today, which is one where we are viewed as having meritocracy at a micro-level (within the contributor communities in a project), but a hegemony at a macro-level, with a Cabal of insiders judging whether programs meet subjective determinations of what is “cloudy enough” or “OpenStack enough”. The more inclusive we are as a community, the more we will attract the application developers to the OpenStack community, and its on the backs of application developers that OpenStack will live or die in the long run.

In this third article in the series, we discuss adding one or more Jenkins slave nodes to the external OpenStack testing platform that you (hopefully) set up in the second article in the series. The Jenkins slave nodes we create today will run Devstack and execute a set of Tempest integration tests against that Devstack environment.

Add a Credentials Record on the Jenkins Master

Before we can add a new slave node record on the Jenkins master, we need to create a set of credentials for the master to use when communicating with the slave nodes. Head over to the Jenkins web UI, which by default will be located at http://$MASTER_IP:8080/. Once there, follow these steps:

Click the Credentials link on the left side panel

Click the link for the Global domain:

Click the Add credentials link

Select SSH username with private key from the dropdown labeled “Kind”

Enter “jenkins” in the Username textbox

Select the “From a file on Jenkins master” radio button and enter /var/lib/jenkins/.ssh/id_rsa in the File textbox:

Click the OK button

Construct a Jenkins Slave Node

We will now install Puppet and the software necessary for running Devstack and Jenkins slave agents on a node.

Slave Node Requirements

On the host or virtual machine that you have selected to use as your Jenkins slave node, you will need to ensure, like the Jenkins master node, that the node has the following:

These basic packages are installed:

wget

openssl

ssl-cert

ca-certificates

Have the SSH keys you use with GitHub in ~/.ssh/. It also helps to bring over your ~/.ssh/known_hosts and ~/.ssh/config files as well.

Have at least 40G of available disk space

.

IMPORTANT NOTE: If you were considering using LXC containers for your Jenkins slave nodes (as I originally struggled to use)…. Use a KVM or other non-shared-kernel virtual machine for the devstack-running Jenkins slaves. Bugs like the inability to run open-iscsi in an LXC container make it impossible to run devstack inside an LXC container.

Download Your Config Data Repository

In the second article in this series, we went over the need for a data repository and, if you followed along in that article, you created a Git repository and stored an SSH key pair in that repository for Jenkins to use. Let’s get that data repository onto the slave node:

Which indicates Puppet done and a set of Nodepool scripts are running to cache upstream OpenStack Git repositories on the node and prepare Devstack. Part of the process of preparing Devstack involves downloading images that are used by Devstack for testing. Note that this step takes a long time! Go have a beer or other beverage and work on something else for a couple hours.

Adding a Slave Node on the Jenkins Master

In order to “register” our slave node with the Jenkins master, we need to create a new node record on the master. First, go to the Jenkins web UI, and then follow these steps:

Click the Manage Jenkins link on the left

Scroll down and click the Manage Nodes link

Click the New Node link on the left:

Enter “devstack_slave1” in the Node name textbox

Select the Dumb Slave radio button:

Click the OK button

Enter 2 in the Executors textbox

Enter “/home/jenkins/workspaces” in the Remote FS root textbox

Enter “devstack_slave” in the Labels textbox

Enter the IP Address of your slave host or VM in the Host textbox

Select jenkins from the Credentials dropdown:

Click the Save button

Click the Log link on the left. The log should show the master connecting to the slave, and at the end of the log should be: “Slave successfully connected and online”:

Test the dsvm-tempest-full Jenkins job

Now we are ready to have our Jenkins slave execute the long-running Jenkins job that uses Devstack to install an OpenStack environment on the Jenkins slave node, and run a set of Tempest tests against that environment. We want to test that the master can successfully run this long-running job before we set the job to be triggered by the upstream Gerrit event stream.

Go to the Jenkins web UI, click on the dsvm-tempest-full link in the jobs listing, and then click the Build Now link. You will notice an executor start up and a link to a newly-running job will appear in the Build History box on the left:

Build History panel in Jenkins

Click on the link to the new job, then click Console Output in the left panel. You should see the job executing, with Bash output showing up on the right:

Manually running the dsvm-tempest-full Jenkins job

Troubleshooting

If you see errors pop up, you will need to address those issues. In my testing, issues generally were around:

Firewall/networking issues: Make sure that the Jenkins master node can properly communicate over SSH port 22 to the slave nodes. If you are using virtual machines to run the master or slave nodes, make sure you don’t have any iptables rules that are preventing traffic from master to slave.

Missing files like “No file found: /opt/nodepool-scripts/…”: Make sure that the install_slave.sh Bash script completed successfully. This script takes a long time to execute, as it pulls down a bunch of images for Devstack caching.

LXC: See above about why you cannot currently use LXC containers for Jenkins slaves that run Devstack

Zuul processes borked: In order to have jobs triggered from upstream, both the zuul-server and zuul-merge processes need to be running, connecting to Gearman, and firing job events properly. First, make sure the right processes are running:

Enabling the dsvm-tempest-full Job in the Zuul Pipelines

Once you’ve successfully run the dsvm-tempest-full job manually, you should now enable this job in the appropriate Zuul pipelines. To do so, on the Jenkins master node, you will want to edit the etc/zuul/layout.yaml file in your data repository (don’t forget to git commit your changes after you’ve made them and push the changes to the location of your data repository’s canonical location).

If you used the example layout.yaml from my data repository and you’ve been following along this tutorial series, the projects section of your file will look like this:

projects:
- name: openstack-dev/sandbox
check:
# Remove this after successfully verifying communication with upstream
# and seeing a posted successful review.
- noop-check-communication
# Uncomment this job when you have a jenkins slave running and want to
# test a full Tempest run within devstack.
#- dsvm-tempest-full
gate:
# Remove this after successfully verifying communication with upstream
# and seeing a posted successful review.
- noop-check-communication
# Uncomment this job when you have a jenkins slave running and want to
# test a full Tempest run within devstack.
#- dsvm-tempest-full

To enable the dsvm-tempest-full Jenkins job to run in the check pipeline when a patch is received (or recheck comment added) to the openstack-dev/sandbox project, simply uncomment the line:

#- dsvm-tempest-full

And then reload Zuul and Zuul-merger:

sudo service zuul reload
sudo service zuul-merger reload

From now on, new patches and recheck comments on the openstack-dev/sandbox project will fire the dsvm-tempest-full Jenkins job on your devstack slave node. If your test run was successful, you will see something like this in your Jenkins console for the job run:

\o/ Steve Holt!

And you will note that on the patch that triggered your Jenkins job will show a successful comment, and a +1 Verified vote:

A comment showing external job successful runs

What Next?

From here, the changes you make to your Jenkins Job configuration files are up to you. The first place to look for ideas is the devstack-vm-gate.sh script. Look near the bottom of that script for a number of environment variables that you can set in order to tinker with what the script will execute.

If you are a Cinder storage vendor looking to test your hardware and associated Cinder driver against OpenStack, you will want to either make changes to the example dsvm-tempest-full or create a copy of that example job definition and customize it to your needs. You will want to make sure that Cinder is configured to use your storage driver in the cinder.conf file. You may want to create some script that copies most of what the devstack-vm-gate.sh script does, and call the devstack ini_set function to configure your storage driver, and then run devstack and tempest.

Publishing Console and Devstack Logs

Finally, you will want to get the log files that are collected by both Jenkins and the devstack run published to some external site. Folks at Arista have used dropbox.com to do this. I’ll leave it up to an exercise for the reader to set this up. Hint: that you will want to set the PUBLISH_HOST variable in your data repository’s vars.sh to a host that you have SCP rights to, and uncomment the publishers section in the example dsvm-tempest-full job:

Final Thoughts

I hope this three-part article series has been helpful for you to understand the upstream OpenStack continuous integration platform, and instructional in helping you set up your own external testing platform using Jenkins, Zuul, and Jenkins Job Builder, and Devstack-Gate. Please do let me know if you run into issues. I will post some updates to the Troubleshooting section above when I hear from you and (hopefully help you resolve any problems).

This post is intended to walk somone through the process of establishing an external testing platform that is linked with the upstream OpenStack continuous integration platform. If you haven’t already, please do read the first article in this series that discusses the upstream OpenStack CI platform in detail. At the end of the article, you should have all the background information on the tools needed to establish your own linked external testing platform.

EXTREMELY IMPORTANT NOTE: The upstream Puppet modules used in this article have changed dramatically since writing this. I am in the process of updating this blog entry, but at this time, some important steps do not work properly!

What Does an External Test Platform Do?

In short, an external testing platform enables third parties to run tests — ostensibly against an OpenStack environment that is configured with that third party’s drivers or hardware — and report the results of those tests on the code review of a proposed patch. It’s easy to see the benefit of this real-time feedback by taking a look at a code review that shows a number of these external platforms providing feedback. In this screenshot, you can see a number Verified +1 and one Verified -1 labels added by external Neutron vendor test platforms on a proposed patch to Neutron:

Each of these systems, when adding a Verified label to a review does so by adding a comment to the review. These comments contain links to artifacts from the external testing system’s test run for this proposed patch, as shown here:

Comments added to a review by the vendor testing platforms

The developer submitting a patch can use those links to investigate why their patch has caused test failures to occur for that external test platform.

Why Set Up an External Test Platform?

The benefits of external testing integration with upstream code review are numerous:

A tight feedback loop

The third party gets quick notifications that a proposed patch to the upstream code has caused a failure in their driver or configuration. The tighter the “feedback loop”, the faster fixes can be identified

Better code coverage

Drivers and plugins that may not be used in the default configuration for a project can be tested with the same rigor and frequency as drivers that are enabled in the upstream devstack VM gate tests. This prevents bitrot and encourages developers to maintain code that is housed in the main source trees.

Increased consistency and standards

Determining a standard set of tests that prove a driver implements the full or partial API of a project means that drivers can be verified to work with a particular release of OpenStack. If you’ve ever had a conversation with a potential deployer of OpenStack who wonders how they know that their choice of storage or networking vendor, or underlying hypervisor, actually works with the version of OpenStack they plan to deploy, then you know why this is a critical thing!

Why might you be thinking about how to set up an external testing platform? Well, a number of OpenStack projects have had discussions already about requirements for vendors to complete integration of their testing platforms with the upstream OpenStack CI platform. The Neutron developer community is ahead of the game, with more than half a dozen vendors already providing linked testing that appears on Neutron code reviews.

The Cinder project also has had discussions around enforcing a policy that any driver that is in the Cinder source tree have tests run on each commit to validate the driver is working properly. Similarly, the Nova community has discussed the same policy for hypervisor drivers in that project’s source tree. So, while this may be old news for some teams, hopefully this post will help vendors that are new to the OpenStack contribution world get integrated quickly and smoothly.

The Tools You Will Need

The components involved in building a simple linked external testing system that can listen to and notify the upstream OpenStack continuous integration platform are as follows:

Jenkins CI

The server that is responsible for executing jobs that run tests for a project

Zuul

A system that configures and manages event pipelines that launch Jenkins jobs

A collection of scripts that constructs an OpenStack environment from source checkouts

I’ll be covering how to install and configure the above components to build your own testing platform using a set of scripts and Puppet modules. Of course, there are a number of ways that you can install and configure any of these components. You can manually install it somewhere by following the install instructions in the component’s documentation. However, I do not recommend that. The problem with manual installation and configuration is two-fold:

If something goes wrong, you have to re-install everything from scratch. If you haven’t backed up your configuration somewhere, you will have to re-configure everything from memory.

You cannot launch a new configuration or instance of your testing platform easily, since you have to manually set everything up again.

A better solution is to use a configuration management system, such as Puppet, Chef, Ansible or SaltStack to manage the deployment of these components, along with a Git repository to store configuration data. In this article, I will show you how to install an external testing system on multiple hosts or virtual machines using a set of Bash scripts and Puppet modules I have collected into a source repository on GitHub. If you don’t like Puppet or would just prefer to use a different configuration management tool, that’s totally fine. You can look at the Puppet modules in this repo for inspiration (and eventually I will write some Ansible scripts in the OpenStack External Testing project, too).

Preparation

Before I go into the installation instructions, you will need to take care of a few things. Follow these detailed steps and you should be all good.

Getting an Upstream Service Account

In order for your testing platform to post review comments to Gerrit code reviews on openstack.org, you will need to have a service account registered with the OpenStack Infra team. See this link for instructions on getting this account.

Don’t have an SSH key pair for your Gerrit service account? You can create one like so:

ssh-keygen -t rsa -b 1024 -N '' -f gerrit_key

The above will produce the key pair: a pair of files called gerrit_key and gerrit_key.pub. Copy the text of the gerrit_key.pub into the email you send to the OpenStack Infra mailing list. Keep both the files handy for use in the next step.

Create a Git Repository to Store Configuration Data

When we install our external testing platform, the Puppet modules are fed a set of configuration options and files that are specific to your environment, including the SSH private key for the Gerrit service account. You will need a place to store this private configuration data, and the ideal place is a Git repository, since additions and changes to this data will be tracked just like changes to source code.

I created a source repository on GitHub that you can use as an example. Instead of forking the repository, like you might would normally do, I recommend instead just git clone’ing the repository to some local directory, and making it your own data repository:

Now you’ve got your own data repository to store your private configuration data and you can put it up in some private location somewhere — perhaps in a private organization in GitHub, perhaps on a Git server you have somewhere.

Put Your Gerrit Service Account Private Key Into the Data Repository

The next thing you will want to do is add your SSH key pair to the repository that you used in the step above that had you register an upstream Gerrit service account.

If you created a new key pair using the ssh-keygen command above. You would copy the gerrit_key file into your data repository.

If you did not create a new key pair (you used an existing key pair) or you created a key pair that wasn’t called gerrit_key, simply copy that key pair into the data repository, then open up the file called vars.sh, and change the following line in it:

export UPSTREAM_GERRIT_SSH_KEY_PATH=gerrit_key

And change gerrit_key to the name of your SSH private key.

Set Your Gerrit Account Username

Next, open up the file vars.sh in your data repository (if you haven’t already), and change the following line in it:

export UPSTREAM_GERRIT_USER=jaypipes-testing

And replace jaypipes-testing with your Gerrit service account username.

Set Your Vendor Name in the Test Jenkins Job

(Optional) Create Your Own Jenkins SSH Key Pair

I have a private/public SSH key pair (named jenkins_key[.pub] in the example data repository. Due to the fact that I’ve put the private key in there, it’s no longer useful as anything other than an example, so you may want to recreate your own. Do so like so:

The above should create an SSL self-signed certificate for Apache to run Jenkins UI with, and then install Jenkins, Jenkins Job Builder, Zuul, Nodepool Scripts, and a bunch of support packages.

Important Note: Since publishing this article, the upstream Zuul system underwent a bit of a refactoring, with the Zuul git-related activities being executed by a separate Zuul worker process called zuul-merger. I’ve updated the Puppet modules in the os-ext-testing repository accordingly, but if you had installed the Jenkins master with Zuul from the Puppet modules before Tuesday, February 18th, 2014, you will need to do the following on your master node to get all reconfigured properly:

When Puppet completes, go ahead and open up the Jenkins web UI, which by default will be at http://$HOST_IP:8080. You will need to enable the Gearman workers that Zuul and Jenkins use to interact. To do this:

Click the `Manage Jenkins` link on the left

Click the `Configure System` link

Scroll down until you see “Gearman Plugin Config”. Check the “Enable Gearman” checkbox.

Note: Darragh O’Reilly noticed when he first did this on his machine, that the Gearman plugin was not actually enabled (though it was installed). He mentioned that simply restarting the Jenkins service fixed this problem, and the Gearman Plugin Config section then appeared on the Manage Jenkins -> Configure System page.

Once you are done with that, it’s time to load up your Jenkins jobs and start Zuul:

If you refresh the main Jenkins web UI front page, you should now see two jobs show up:

Jenkins Master Web UI Showing Sandbox Jenkins Jobs Created by JJB

Testing Communication Between Upstream and Your Master

Congratulations. You’ve successfully set up your Jenkins master. Let’s now test connectivity between upstream and our external testing platform using the simple sandbox-noop-check-communication job. By default, I set this Jenkins job to execute on the master node for the openstack-dev/sandbox project [1]. Here is the project configuration in the example data repository’s etc/jenkins_jobs/config/projects.yaml file:

Note that the node is master by default. The sandbox-dsvm-tempest-full Jenkins Job is configured to run on a node labeled devstack_slave, but we will cover that later when we bring up our Jenkins slave.

By default, the only job that is enabled is the sandbox-noop-check-communication Jenkins job, and it will get run whenever a patchset is created in the upstream openstack-dev/sandbox project, as well as any time someone adds a comment with the words “recheck no bug” or “recheck bug XXXXX”. So, let us create a sample patch to that project and check to see if the sandbox-noop-check-communication job fires properly.

Before we do that, let’s go ahead and tail the Zuul debug log, grepping for the term “sandbox”. This will show messages if communication is working properly.

sudo tail -f /var/log/zuul/debug.log | grep sandbox

OK, now create a simple test patch in sandbox. Do this on your development workstation, not your Jenkins master:

If you go to the link to the code review in Gerrit (the link that output after you ran git review), you will see your Gerrit testing account has added a +1 Verified vote in the code review:

Successful communication between upstream and our external system

Congratulations. You now have an external testing platform that is receiving events from the upstream Gerrit system, triggering Jenkins jobs on your master Jenkins server, and writing reviews back to the upstream Gerrit system. The next article goes over adding a Jenkins slave to your system, which is necessary to run real Jenkins jobs that run devstack-based gate tests. Please do let me know what you think of both this article and the source repository of scripts to set things up. I’m eager for feedback and critique.

[1]— The OpenStack Sandbox project is a project that can be used for testing the integration of external testing systems with upstream. By creating a patch against this project, you can trigger the Jenkins jobs that are created during this tutorial.

This post describes in detail the upstream OpenStack continuous integration platform. In the process, I’ll be describing the code flow in the upstream system — from the time the contributor submits a patch to Gerrit, all the way through the creation of a devstack environment in a virtual machine, the running of the Tempest test suite against the devstack installation, and finally the reporting of test results and archival of test artifacts. Hopefully, with a good understanding of how the upstream tooling works, setting up your own linked external testing platform will be easier.

Some History and Concepts

Over the past four years, there has been a steady evolution in the way that the source code of OpenStack projects is tested and reviewed. I remember when we used Bazaar for source control and Launchpad merge proposals for code review. There was no automated or continuous testing to speak of in those early days, which put pressure on core reviewers to do testing of proposed patches locally. There was also no standardized integration test suite, so often a change in one project would inadvertantly break another project.

Thanks to the work of many contributors, particularly those patient souls in the OpenStack Infrastructure team, today there is a robust platform supporting continuous integration testing for OpenStack and Stackforge projects. At the center of this platform are the Jenkins CI servers, the Gerrit git and patch review server, and the Zuul gating system.

The Code Review System

When a contributor submits a patch to one of the OpenStack projects, one pushes their code to the git server managed by Gerrit running on review.openstack.org. Typically, contributors use the git-review Git plugin, which simplifies submitting to a git server managed by Gerrit. Gerrit controls which users or groups are allowed to propose code, merge code, and administer code repositories under its management. When a contributor pushes code to review.openstack.org, Gerrit creates a Changeset representing the proposed code. The original submitter and any other contributors can push additional amendments to that Changeset, and Gerrit collects all of the changes into the Changeset record. Here is a shot of a Changeset under review. You can see a number of patches (changes) listed in the review screen. Each of those patches was an amendment to the original commit.

Individual patches amend the changeset

For each patch in Gerrit, there are three sets of “labels” that may be applied to the patch. Anyone can comment on a Changeset and/or review the code. A review is shown on the patch in the Code-Review column in the patch “labels matrix”:

The “label matrix” on a Gerrit patch

Non-core team members may give the patch a Code-Review label of +1 (Looks good to me), 0 (No strong opinion), or -1 (I would prefer you didn’t merge this). Core team members can give any of those values, plus +2 (Looks good to me, approved) and -2 (Do not submit).

The other columns in the label matrix are Verified and Approved. Only non-interactive users of Gerrit, such as Jenkins, are allowed to add a Verified label to a patch. The external testing platform you will set up is one of these non-interactive users. The value of the Verified label will be +1 (check pipeline tests passed), -1 (check pipeline tests failed), +2 (gate pipeline tests passed), or -2 (gate pipeline tests failed).

Only members of the OpenStack project’s core team can add an Approved label to a patch. It is either a +1 (Approved) value or not, appearing as a check mark in the Approved column of the label matrix:

An approved patch.

Continuous Integration Testing

Continuous integration (CI) testing is the act of running tests that validate a full application environment on a continual basis — i.e. when any change is proposed to the application. Typically, when talking about CI, we are referring to tests that are run against a full, real-world installation of the project. This type of testing, called integration testing, ensures that proposed changes to one component do not cause failures in other components. This is especially important for complex multi-project systems like OpenStack, with non-trivial dependencies between subsystems.

When code is pushed to Gerrit, a series of jobs are triggered that run a series of tests against the proposed code. Jenkins is the server that executes and manages these jobs. It is a Java application with an extensible architecture that supports plugins that add functionality to the base server.

Each job in Jenkins is configured separately. Behind the scenes, Jenkins stores this configuration information in an XML file in its data directory. You may manually edit a Jenkins job as an administrator in Jenkins. However, in a testing platform as large as the upstream OpenStack CI system, doing so manually would be virtually impossible and fraught with errors. Luckily, there is a helper tool called Jenkins Job Builder (JJB) that constructs these XML configuration files after reading a set of YAML files and job templating rules. We will describe JJB later in the article.

The “Gate”

When we talk about “the gate”, we are talking about the process by which code is kept out of a set of source code branches if certain conditions are not met.

OpenStack projects use a method of controlling merges into certain branches of their source trees called the Non-Human Gatekeeper model [1]. Gerrit (the non-human) is configured to allow merges by users in a group called “Non-Interactive Users” to the master and stable branches of git repositories under its control. The upstream main Jenkins CI server, as well as Jenkins CI systems running at third party locations, are the users in this group.

So, how do these non-interactive users actually decide whether to merge a proposed patch into the target branch? Well, there is a set of tests (different for each project) — unit, functional, integration, upgrade, style/linting — that is marked as “gating” that particular project’s source trees. For most of the OpenStack projects, there are unit tests (run in a variety of different supported versions of Python) and style checker tests for HACKING and PEP8 compliance. These unit and style tests are run in Python virtualenvs managed by the tox testing utility.

In addition to the Python unit and style tests, there are a number of integration tests that are executed against full installations of OpenStack. The integration tests are simply subsets of the Tempest integration test suite. Finally, many projects also include upgrade and schema migration tests in their gate tests.

How Upstream Testing Works

Graphically, the upstream continuous integration gate testing system works like this:

We step through this event flow in detail below, referencing the numbered steps in bold.

The Gerrit Event Stream and Zuul

After a contributor has pushed (1a) a new patch to a changeset or a core team member has reviewed the patch and added an Approved +1 label (1b), Gerrit pushes out a notification event to its event stream(2). This event stream can have a number of subscribers, including the Gerrit Jenkins plugin and Zuul. Zuul was developed to manage the many complex graphs of interdependent branch merge proposals in the upstream system. It monitors in-progress jobs for a set of related patches and will pre-emptively cancel any dependent test jobs that would not succeed due to a failure in a dependent patch [2].

In addition to this dependency monitoring, Zuul is responsible for constructing the pipelines of jobs that should be executed on various events. One of these pipelines is called the “gate” pipeline, appropriately named for the set of jobs that must succeed in order for a proposed patch to be merged into a target branch.

Zuul listens to the Gerrit event stream (3), and matches the type of event to one or more pipelines (4). The matching conditions for the gate pipeline are configured in the trigger:gerrit: section of the YAML snippet above:

The above indicates that Zuul should fire the gate pipeline when it sees reviews with an Approved +1 label, and any comment to the review that contains “reverify” with or without a bug identifier. Note that there is a similar pipeline that is fired when a new patchset is created or when a review comment is made with the word “recheck”. This pipeline is called the check pipeline. Look in the layout.yaml file for the configuration of the check pipeline.

Once the appropriate pipeline is matched, Zuul executes (5) that particular pipeline for the project that had a patch proposed.

“But wait, hold up…“, you may be asking yourself, “how does Zuul know which Jenkins jobs to execute for a particular project and pipeline?“. Great question!

Also in the layout.yaml file, there is a section that configures which Jenkins jobs should be run for each project. Let’s take a look at the configuration of the gate pipeline for the Cinder project:

Each of the lines in the gate: section indicate a specific Jenkins job that should be run in the gate pipeline for Cinder. In addition, there is the python-jobs item in the template: section. Project templates are a way that Zuul consolidates configuration of many similar jobs into a simple template configuration. The project template definition for python-jobs looks like this (still in layout.yaml:

So, on determing which Jenkins jobs should be executed for a particular pipeline, Zuul sees the python-jobs project template in the Cinder configuration and expands that to execute the following Jenkins jobs:

gate-cinder-docs

gate-cinder-pep8

gate-cinder-python26

gate-cinder-python27

Jenkins Job Creation and Configuration

I previously mentioned that the configuration of an individual Jenkins job is stored in a config.xml file in the Jenkins data directory. Now, at last count, the upstream OpenStack Jenkins CI system has just shy of 2,000 jobs. It would be virtually impossible to manage the configuration of so many jobs using human-based processes. To solve this dilemma, the Jenkins Job Builder (JJB) python tool was created. JJB consumes YAML files that describe both individual Jenkins jobs as well as templates for parameterized Jenkins jobs, and writes the config.xml files for all Jenkins jobs that are produced from those templates. Important: Note that Zuul does not construct Jenkins jobs. JJB does that. Zuul simply configures which Jenkins jobs should run for a project and a pipeline.

There is a master projects.yaml file in the same directory that lists the “top-level” definitions of jobs for all projects, and it is in this file that many of the variables that are used in job template instantiation are defined (including the {name} variable, which corresponds to the name of the project.

When JJB constructs the set of all Jenkins jobs, it reads the projects.yaml file, and for each project, it sees the “name” attribute of the project, and substitutes that name attribute value wherever it sees {name} in any of the jobs that are defined for that project. Let’s take a look at the Cinder project’s definition in the projects.yaml file here:

You will note one of the items in the jobs section is called python-jobs. This is actually not a single Jenkins job, but actually a job group. A job group definition is merely a list of jobs or job templates. Let’s take a look at the definition of the python-jobs job group:

Each of the items listed in the jobs section of the python-jobs job group definition above is a job template. Job templates are expanded in the same way as Zuul project templates and JJB job groups are expanded. Let’s take a look at one such job template in the list above, called gate-{name}-python27.

The python-jobs.yaml file in the modules/openstack_project/files/jenkins_job_builder/config directory contains the definition of common Python project Jenkins job templates. One of those job templates is gate-{name}-python27:

Looking through the above job template definition, you will see a section called “builders“. The builders section of a job template lists (in sequential order of expected execution) the executable sections or scripts of the Jenkins job. The first executable section in the gate-{name}-python27 job template is called “gerrit-git-prep“. This executable section is defined in macros.yaml, which contains a number of commonly-run scriptlets. Here’s the entire gerrit-git-prep macro definition:

So, gerrit-git-prep is simply executing a Bash script called “gerrit-git-prep.sh” that is stored in the /usr/local/jenkins/slave_scripts/ directory. Let’s take a look at that file. You can find it in the /modules/jenkins/files/slave_scripts/[3]directory in the same OpenStack Infra Config project:

The purpose of the script above is simple: Check out the source code of the proposed Gerrit changeset and ensure that the source tree is clean of any cruft from a previous run of a Jenkins job that may have run in the same Jenkins workspace. The concept of a workspace is important. When Jenkins runs a job, it must execute that job from within a workspace. The workspace is really just an isolated shell environment and filesystem directory that has a set of shell variables export’d inside it that indicate a variety of important identifiers, such as the Jenkins job ID, the name of the source code project that has triggered a job, the SHA1 git commit ID of the particular proposed changeset that is being tested, etc [4].

The next builder in the job template is the “python27” builder, which has two variables injected into itself:

- python27:
github-org: '{github-org}'
project: '{name}'

The github-org variable is a string of the already existing {github-org} variable value. The project variable is populated with the value of the {name} variable. Here’s how the python27 builder is defined (in macros.yaml:

In short, for the Python 2.7 builder, the above runs the command tox -epy27 and then runs a prettifying script and gzips up the results of the unit test run. And that’s really the meat of the Jenkins job. We will discuss the publishing of the job artifacts a little later in this article, but if you’ve gotten this far, you have delved deep into the mines of the OpenStack CI system. Congratulations!

Devstack-Gate and Running Tempest Against a Real Environment

OK, so unit tests running in a simple Jenkins slave workspace are one thing. But what about Jenkins jobs that run integration tests against a full set of OpenStack endpoints, interacting with real database and message queue services? For these types of Jenkins jobs, things are more complicated. Yes, I know. You probably think things have been complicated up until this point, and you’re right! But the simple unit test jobs above are just the tip of the proverbial iceberg when it comes to the OpenStack CI platform.

For these complex Jenkins jobs, an additional set of tools are added to the mix:

Devstack-Gate — Scripts that create an OpenStack environment with Devstack, run tests against that environment, and archive logs and results

Assignment of a Node to Run a Job

Different Jenkins jobs require different workspaces, or environments, in which to run. For basic unit or style-checking test jobs, like the gate-{name}-python27 job template we dug into above, not much more is needed than a tox-managed virtualenv running in a source checkout of the project with a proposed change. However, for Jenkins jobs that run a series of integration tests against a full OpenStack installation, a workspace with significantly more resources and isolation is necessary. For these latter types of jobs, the upstream CI platform uses a pool of virtual machine instances. This pool of virtual machine instances is managed by a tool called nodepool. The virtual machines run in both HP Cloud and Rackspace Cloud, who graciously donate these instances for the upstream CI system to use. You can see the configuration of the Nodepool-managed set of instances here.

Instances that are created by Nodepool run Jenkins slave software, so that they can communicate with the upstream Jenkins CI master servers. A script called prepare_node.sh runs on each Nodepool instance. This script just git clones the OpenStack Infra config project to the node, installs Puppet, and runs a Puppet manifest that sets up the node based on the type of node it is. There are bare nodes, nodes that are meant to run Devstack to install OpenStack, and nodes specific to the Triple-O project. The node type that we will focus on here is the node that is meant to run Devstack. The script that runs to prepare one of these nodes is prepare_devstack_node.sh, which in turn calls prepare_devstack.sh. This script caches all of the repositories needed by Devstack, along with Devstack itself, in a workspace cache on the node. This workspace cache is used to enable fast reset of the workspace that is used during the running of a Jenkins job that uses Devstack to construct an OpenStack environment.

Devstack-Gate

The Devstack-Gate project is a set of scripts that are executed by certain Jenkins jobs that need to run integration or upgrade tests against a realistic OpenStack environment. Going back to the Cinder project configuration in the Zuul layout.yaml file:

Note the highlighted line. That Jenkins job template is one such job that needs an isolated workspace that has a full OpenStack environment running on it. Note that “dsvm” stands for “Devstack virtual machine”.

Not all that complicated. It exports some environment variables and copies the devstack-vm-gate-wrap.sh script out of the devstack-gate repo that was clone’d in the devstack-checkout macro to the work directory and then runs that script.

Construction of OpenStack Environment with Devstack

The devstack-vm-gate.sh script is responsible for constructing a full OpenStack environment and running integration tests against that environment. To construct this OpenStack environment, it uses the excellent Devstack project. Devstack is an elaborate series of Bash scripts and functions that clones each OpenStack project source code into /opt/stack/new/$project[5]— , runs python setup.py install in each project checkout, and starts each relevant OpenStack service (e.g. nova-compute, nova-scheduler, etc) in a separate Linux screen session.

Devstack’s creation script (stack.sh) is called from the script after creating the localrc file that stack.sh uses when constructing the Devstack environment.

You will note that the $DEVSTACK_GATE_TEMPEST_FULL Bash environment variable was set to “1” in the gate-tempest-dsvm-full Jenkins job builder scriptlet.

sudo -H -u tempest tox -efull triggers the execution of Tempest’s integration test suite. Tempest is the collection of canonical OpenStack integration tests that are used to validate that OpenStack APIs work according to spec and that patches to one OpenStack service do not inadvertently cause failures in another service.

If you are curious what actual commands are run, you can check out the tox.ini file in Tempest:

[testenv:full]
# The regex below is used to select which tests to run and exclude the slow tag:
# See the testrepostiory bug: https://bugs.launchpad.net/testrepository/+bug/1208610
commands =
bash tools/pretty_tox.sh '(?!.*\[.*\bslow\b.*\])(^tempest\.(api|scenario|thirdparty|cli)) {posargs}'

Archival of Test Artifacts

The final piece of the puzzle is archiving all of the artifacts from the Jenkins job execution. These artifacts include log files from each individual OpenStack service running in Devstack’s screen sessions, the results of the Tempest test suite runs, as well as echo’d output from the devstack-vm-gate* scripts themselves.

Conclusion

I hope this article has helped you understand a bit more how the OpenStack continuous integration platform works. We’ve stepped through the flow through the various components of the platform, including which events trigger what actions in each components. You should now have a good idea how the various parts of the upstream CI infrastructure are configured and where to go look in the source code for more information.

The next article in this series discusses how to construct your own external testing platform that is linked with the upstream OpenStack CI platform. Hopefully, this article will provide you most of the background information you need to understand the steps and tools involved in that external testing platform construction.

[1]— The link describes and illustrates the non-human gatekeeper model with Bazaar, but the same concept is applicable to Git. See the OpenStack GitWorkflow pages for an illustration of the OpenStack specific model.[2]— Zuul really is a pretty awesome bit of code kit. Jim Blair, the author, does an excellent job of explaining the merge proposal dependency graph and how Zuul can “trim” dead-end branches of the dependency graph in the Zuul documentation.[3]— Looking for where a lot of the “magic” in the upstream gate happens? Take an afternoon to investigate the scripts in this directory. [4]— Gerrit Jenkins plugin and Zuul export a variety of workspace environment variables into the Jenkins jobs that they trigger. If you are curious what these variables are, check out the Zuul documentation on parameters.[5]— The reason the projects are installed into /opt/stack/new/$project is because the current HEAD of the target git branch for the project is installed into /opt/stack/old/$project. This is to allow an upgrade test tool called Grenade to test upgrade paths.

Martin Stoyanov recently asked an excellent question about the proper way to push revisions to a changeset in Gerrit. This is a very common question that folks new to Gerrit or Git have, and I think it deserves its own post.

When you do the first push of a local working branch to Gerrit, the act of pushing your code creates a Gerrit changeset. The changeset can be reviewed, and in the process of doing that review, it’s common for reviewers to request that the submitter make some changes to the code. Sometimes these changes are stylistic or cosmetic. Other times, the requested modifications can be extensive.

How you handle making the requested modifications and submitting those changes back to Gerrit depends on a few things:

Are the changes requested mostly stylistic or cosmetic?

Are the changes requested going to provide additional functionality that is dependent on the existing changeset?

Are the changes requested going to provide additional functionality that is independent of the existing changeset?

Depending on the answers to the above questions, you should either amend the existing changeset commit, push a new commit to Gerrit from the same local branch, or push a new commit from a new local branch. Here’s some quick guidelines to help you decide:

The Changes Requested Are Cosmetic or Style-related

When a reviewer is providing some stylistic advice or offering suggestions for cosmetic changes or cleanups, you should amend the original commit. Do so like this:

While this looks fairly simple (and it is…), many folks make a fatal mistake when they modify the commit message and add sections that describe the “sausage-making” involved in the cleanups. DO NOT do this. It’s not necessary. Avoid adding any lines to the commit message that look like this:

“Cleaned up whitespace”

“DRY’d up some stuff based on review comments”

“Fixed typos found during reviews”

If all you did was correct typos and whitespace, simply leave the commit message as it was originally. After the call to git review, you will see a new patchset appear in the original code review. This is expected. The changeset is still viewed by Gerrit and reviewers as a single changeset, and reviewers may even select “Patchset 1″ from the “Old Version History” dropdown instead of “Base” in order to see only the changes made in this last amended commit.

The Changes Requested Are Extensive and Depend on Original Commit

If a reviewer has asked for modifications to your original code, and the requested modifications are fairly extensive and depend on the code in your original commit, you have four choices:

Amend the original commit to include all new changes

Amend the original commit for some things, push those changes to the original commit, make additional changes in the same local branch, git commit those additional changes and git review

Lobby for your original commit to be accepted as-is, then when your change is accepted and merged into master, then create a new branch from master and push the additional changes in a new changeset

Whenever you do not amend a commit and issue a call to git review, you will be created a dependent changeset. Gerrit will assign a new Change-Id to the patchset, but understands that the commit logically follows your original changeset’s code. If you go to the code review screen of your newly-created changeset, you will see your original changeset referenced in the “Dependencies” section. Below, you can see a screenshot of a changeset that is part of a “dependency chain”. Another patchset is dependent on this patchset and this patchset is dependent on another patchset.

Changeset showing a dependent changeset and a changeset dependent on this one (chained dependent patchsets)

It’s best to avoid long chains of dependent patchsets. The reason is because if a reviewer requests changes for one of the changesets at the “bottom” of the dependency chain, the entire chain of changesets (even changesets that are approved like the one shown above) are going to be held up from going through the gate tests.

The Changes Requested Are Extensive but are Independent of the Original Commit

If a reviewer has requested extensive changes, but points out that the changes they want made are actually independent of the changes in your original commit, the reviewers will generally ask the original committer to wait until this changeset is merged and create a new branch for the additional work. Normally, depending on the extent of the requested changes, reviewers will insist that the submitter create a new bug or blueprint on Launchpad to keep track of the additional work they feel is needed.

Conclusion

It’s up to the discretion of core reviewers and the original submitter to work out which of the above solutions works best for the particular changeset. Each changeset introduces code and functionality that must be treated differently, and changesets from one submitter may be dealt with differently than others. However, to keep things simple for yourself and upstream reviewers, it’s best to follow this simple advice:

Prefer to amend the original commit. In most cases, this is the appropriate solution to push revisions to Gerrit.

Don’t include sausage-making comments in the commit message.

Prefer free-standing changesets to long chains of dependent patches.

Ask reviewers what their preferences are.

Follow those guidelines and you’ll keep yourself out of the weeds. For more detailed information, including strategies for handling updates to your code that is dependent on another branch of code that gets updated, see the excellent OpenStack GerritWorkflow documentation.

Honestly, if you are a developer of a program that has a dialog box that asks the user if they want to exit the program when they’ve clicked the close window, you should be sent to remedial programming and UI design class.

I’m looking at you … developers of aMule and AT&T’s Global Network Client.

For those of you new to Gerrit, remember that when you make inline comments in a patch review, you need to click the “Review” button and then the “Publish Comments” button in order for your comments to be visible to others on the review. Otherwise, your inline comments stay in Draft mode!