Introducing PuppetDB: Put Your Data to Work

Breadcrumb

PuppetDB is the next-generation open source storage service for Puppet-produced data. Today, this includes catalogs and facts, and will be extended in the near future. The initial release provides a drop-in replacement for both storeconfigs and inventory service.

We’ve designed PuppetDB to empower Puppet deployments, and built it from the ground up with performance in mind. It’s built on technologies known for their performance, and is highly parallel, making full use of available resources. It also stores all of its data asynchronously, freeing up the master to go compile more catalogs. Beyond that, we’ve devoted copious time to benchmarking and optimizing the performance.

Why PuppetDB?

The most immediate benefit of PuppetDB is improved performance for storeconfigs users, but even for others, it has a lot to offer. As a centralized store, PuppetDB knows about every node, resource, relationship, and fact across your entire infrastructure. All this information is easily queryable, so you can integrate it into your tools and workflow, or just satisfy your curiosity. It also provides a platform on which powerful new tooling will be built.

And if you’re not using storeconfigs, you should be. At its heart, storeconfigs can be thought of as “higher-order Puppet.” It’s a way for multiple nodes to interact with each other through Puppet, which is an immensely powerful feature. In any case where one node knows what another node is doing, storeconfigs may help.

For instance, storeconfigs can be used to configure a monitoring service, without knowing upfront any of the nodes or services being monitored. Each node to be monitored can simply define what ought to be checked, and those checks can be collected on the node doing the monitoring. Or it can be used to share SSH authorized keys, by having each node export its key, and collect everyone else’s.

Built for performance

Let’s talk about performance. I told you it was a key design goal, but just how much faster is PuppetDB than the existing solution? To find out, I ran an experiment against the old, ActiveRecord storeconfigs implementation.

I compiled and saved a catalog of 650 resources, using an initially empty PostgreSQL database. Compilation took 5.6 seconds. With nothing in the database, it took 53 seconds to store the catalog. That’s brushing right up on the agent’s timeout, risking an outright failure. With the database now primed, I submitted the same catalog a second time, unmodified, which took 4 seconds.

To find out how PuppetDB performs, we have much more information available to us. The service is highly instrumented to keep metrics on every aspect of its performance, all of which is made available over HTTP and JMX.

This is the PuppetDB dashboard, which uses the HTTP metrics API to give an overview of the current state of the system. The dashboard comes built-in, and updates live, even on your mobile device! Taking a look at this screenshot (taken from our internal PuppetDB instance), we can see the backlog of work, how long command processing is taking, how much work has been done, how large the database is, and much more. And yet this is still only a small subset of the metrics we track and make available.

In particular, we see that the queue is empty, meaning PuppetDB is keeping up with demand. Looking at the number of nodes and resources in the population, we can easily calculate that the average size of a catalog is ~670 resources. The average time to process a command is 394ms. This is around 130x faster than the worst case time of old-school storeconfigs, and 10x better than the case where catalogs are already present. We also see that PuppetDB is responding to storeconfigs queries in only 65ms.

Admittedly, these numbers are somewhat incomparable; for instance, the very first catalog stored in PuppetDB may take some extra time, but catalogs which are unchanged will be negligible. But this gives some indication of the improvement we’re talking about. It’s also important to note that all of this storage is asynchronous, freeing up the master to continue serving catalogs. Previously, the master would have been occupied waiting for storeconfigs.

Reliable data store

So we can see that PuppetDB stores your data more quickly, but what about the data itself? After all, that’s what you really care about. PuppetDB makes a few promises about its data: it will be complete, it will be accurate, and it will be current.

Every aspect of the catalog is stored, including edges and unexported resources, which are omitted in old storeconfigs and the popular thin_storeconfigs mode respectively. Nuances of the catalog like resource aliases are also respected, ensuring that every resource and edge is present and accurately represented.

It’s downright difficult to lose your data with PuppetDB. It takes great care not to let that happen, by accepting it into a persistent queue, and trying up to sixteen times (even across service restarts) to handle the command, ensuring that if the data is good, it will make it into the database. And if it somehow still doesn’t make it in, the command will be saved away with plenty of forensic data for later investigation and reprocessing.

In that vein, when configured to use PuppetDB, Puppet will refuse to serve catalogs if PuppetDB is down and the catalog can’t be persisted. This means the data PuppetDB has will always be current; an agent will never use a catalog that PuppetDB doesn’t know about.

And it’s secure. All communication between the puppet master and PuppetDB happens over SSL, authenticated with the same certificates as used for communication between puppet master and agents. Similarly, if PuppetDB and its database are separate, it’s a simple matter to secure their connection.

Plays well with others

PuppetDB is a key component of the Puppet Data Library, and brings that to bear in its query API. Resources, facts, nodes, and metrics can all be queried over HTTP. For resources and nodes, there is a simple query language which can be used to form arbitrarily complex requests. The public API is the same one that Puppet uses to make storeconfigs queries (using the <> operator) of PuppetDB, but provides a superset of the functionality provided by storeconfigs. The API is fully documented and versioned, for use in scripts, Faces, or custom Puppet functions.

PuppetDB is faster, smarter, and has more complete data than ever before. If you’re a current storeconfigs user, there’s no reason not to try it out immediately. If you don’t use storeconfigs (and especially if performance was the reason), now is the time to start. We know that storeconfigs, while being a powerful and important feature, has historically been a pain point for users. One of the goals of PuppetDB is to alleviate that and personally, I want a world in which everyone uses storeconfigs and loves it. PuppetDB offers great power over and insight into your infrastructure, and it’s only going to get bigger and better.