binford2k.com

Impact Analysis of Puppet Modules

Have you ever wondered who’s using your Puppet modules? Or have you hesitated
before changing a class parameter because you don’t really know how many people
will be affected downstream? Maybe you hesitated before deprecating a barely
supported and almost certainly unused subclass because… well, you didn’t really
know for sure that it was unused.

Rangefinder is the
tool for you. Just run it on the source code you’re working on and it will tell
you who might be affected.

The tool is basically a glorified database client. It works by identifying the
component generated by that source file and then querying for the usage of that
component. It can recognize Puppet types, functions, classes, and defined types.

The data used to identify downstream dependents come from a public BigQuery database
containing indexed and aggregated data from both the Forge and GitHub. Here’s the
query behind that command, right in the GCP console:

As you can see, Rangefinder uses both the source and the repo columns to
tailor how it displays results. Rows in which the source column matches the
metadata from the module you’re running the command from will be displayed as
exact (WILL impact) matches, and ones that don’t are possible (MAY impact)
matches. We’ll talk more shortly about what that means.

Gathering data

But first, let’s talk about how the data is collected. Each week a cron job runs
a simple data aggregation tool.
This does several things.

It mirrors Puppet-related data from the public GitHub datasets so we can make
queries easier on our budget with less-than-terabyte sized tables.

It gathers and flattens public data from the Puppet Forge into an easily queryable form.

It downloads each new release and runs certain kinds of static analysis against it.

This allows you to do things like retrieve forwards and backwards dependencies,
or to join data from the Forge and GitHub. For example, have you ever wondered
how many Forge modules define new native types (and are hosted on GitHub)?

Itemization

The coolest part, to me at least, is the static analysis it does. This uses my
puppet-itemize gem which
deconstructs Puppet manifests into all the types, classes, resources, functions
that they declare or invoke. Because it’s not compiling, it doesn’t care about
conditional logic and effectively just returns a list of all items referenced in
the source code, regardless of the code path.

If I run Puppet Itemize against the first module listed in the puppetlabs/concat
Rangefinder results, I see this:

The result of this analysis is saved into the BigQuery database along with the
name of the module, and then when Rangefinder runs, it will match on the two
instances of concat::fragment that you see in the output above.

What next?

So where do we go from here? This is actually several steps into a larger
metrics project. I’m sure that you’ve connected the dots by now that so far this
is only operating on already public data. You can already query the
Forge API, you can look at module’s metadata.json
or source code, you can query GitHub. That means that this tool is only making
it more convenient to do what you could already do!

What if we had access to actual usage data? What kind of development decisions
would you make if you knew how many infrastructures are declaring what classes
of your modules? Or maybe what different platforms people are running your
modules on? Or what versions of your module that people are running? Or maybe even
just how many people are using your module in their internal profile classes?

You won’t be surprised to know that I’m working on that also. It’s a much larger
project because there are a ton of privacy considerations that we had to address
before even thinking about asking people to enable telemetry.

Our two top design constraints while building the client were privacy and
transparency and we’re now dogfooding it in our internal infrastructure to
watch for sensitive information leaking. Keep an eye out for another post soon
showing how that system works and how you can build your own tools to query the
data it gathers.

Installing and using

If you’ve made it this far, maybe you’d like to try it out. You can simply gem
install it and run it on the command line.

[~]$ gem install puppet-community-rangefinder
[~]$ rangefinder --help
Usage: rangefinder <paths>
Run this command with a space separated list of file paths in a module and it
will infer what each file defines and then tell you what Forge modules use it.
It will separate output by the modules that we KNOW will be impacted and those
which we can only GUESS that will be impacted. We can tell the difference based
on whether the impacted module has properly described dependencies in their
`metadata.json`. These are rendered as *exact match* and *near match*.
Note that non-namespaced items will always be near match only.
-r, --render-as FORMAT Render the output as human, summarize, json, or yaml
-v, --verbose Show verbose output
-d, --debug Show debugging messages
--shell Open a pry shell for debugging (must have Pry installed)
--version Show version number

Do let me know how this works for you, and if there are ways it could work better.
I’ll post next about the webhook version so that you can get this impact analysis
automatically attached to your GitHub pull requests so that you know how much
of an impact incoming PRs can have before merging them.