Artifactory Prune Report

December 28, 2016

My team uses Artifactory as our binary repository. If you’re reading this, you probably use it too. It has this neat ability to prune unreferenced data. An admin feature to be grateful for when needed, but the fact that you need it means something’s busted in a bad way.

As the documentation points out:

Unreferenced binary files may occur due to running with wrong file system permissions on storage folders, or running out of storage space.
When you invoke this action, Artifactory removes unreferenced binary files and empty folders present in the filestore or cache folders.
To avoid such errors, we recommend that you always allow Artifactory to shut down completely.

Today’s goal is to know what will be deleted by the ‘prune unreferenced data’ feature without deleting anything.

How I Learned About It

Well, today I discovered that I made a mistake. First time, right? Over the Christmas Holiday, our Artifactory server got into a bad state. Ulimately it was out of disk space, but then I corrupted the database with some bad assumptions. Our team migrated Artifactory to a new server several months ago. Unfortunately, it appears that we migrated the database to internal storage and never noticed. Enter the red herring. On Christmas Eve we did a well-tested live migration of the new Artifactory server between VMWare vCenters. Everything worked just fine, but it was a new process we’d never done before.

On the day after Christmas, while out of town and still on holiday, I saw an email come through from an overseas co-worker about the state of our Artifactory instance. It says Derby is throwing errors, and I’m like, ‘Derby? That’s the internal storage engine! We aren’t even using that!’. And I set off to see why the live migration failed. I check the configuration files and see that indeed, we are not using the database I thought we should be using. I set off to resolve this and get back to my vacation. I update the configuration to use the external database, grabbing connection strings from an old backup. I even add the configuration changes to our Salt formula, so I never have to dig it up again. Our DevOps team is all stateside, so naturally no one is available to corroborate my course of action.

Over the next two days, symptoms are appearing right and left that I made a bad call. To make matters worse, once I had the server running against the external database, I had deleted the Derby files because we were out of space. And…no backups have been done since the migration, which was eight months ago. The backup storage location was never mounted on the new server. It’s painful to write about this. But that’s the point! Maybe, just maybe, I will never let my team do this again because I remember the pain of writing this. I’ll just point them to the blog post and say, backups are important, and that blog post explains why!

Managing the Damage

In summary, our filestore has eight months of uploads that our database does not have. If the ‘prune reference data’ command were to be run, all of these files would disappear. That’s good and bad. We could get disk space back that we are not using. But we could lose files we can’t easily re-create. What we need is a report! (Not really. That’s a joke. We do need data though.)

Grab the API Data

Artifactory has a good REST API. The data I’m after is the complete list of files that the database believes I should have, and the complete list of files that are on disk. I came up with this Powershell snippet for the former:

Grab the Filesystem Data

With the database data in-hand, you can get the list of files on disk. Our Artifactory server runs on Linux, so I just used find . -ls > filestore.out from the root of the filestore to get that list. It looks like so:

Artifactory stores each file by it’s sha1 hash, sorted into subdirectories by the first two characters of the hash. There’s no metadata or anything else useful here, as far as I can tell.

If you were running Artifactory on Windows, you could run dir -recurse > filestore.out to get a similar list.

Merge

Now we have to merge the two lists. I copied my filestore.out file back over to my workstation and then iterated over the filestore list, which is unstructured and larger. We had 12,000 files reported from the api, but 33,000 from the filestore. Here’s the whole script.

See that one entry in the results column? That’s the only file in this list that exists in the database. Ouch. I’m not off to a very good start. Once my data is available, I will have to figure out how to classify the contents of each filestore file without a matching database entry. I’ll save that for another post.

Lessons Learned

First, always do backups. Always. Don’t ever ever not do backups. If you have backups, don’t turn them off. Don’t disable them and promise to fix them later without a plan to actually do it. And of course, test your backups. You can’t test them if you don’t have them! Second, communication with teammates is so important. Third, automate configuration management. While it may not help restore lost data, it can build confidence in a course of action that prevents that from occurring to begin with. And just as with backups, don’t procrastinate this until later.