Sometimes “bad” data gets committed to Pachyderm and you need a way to delete
it. There are a couple of ways to address this, which depend on
what exactly was “bad” about the data you committed and what’s happened in the
system since you committed the “bad” data.

Deleting the HEAD of a branch - You should
follow this guide if you’ve just made a commit to a branch with some corrupt, incorrect,
or otherwise bad changes to your data.

Deleting non-HEAD commits - You should follow
this guide if you’ve committed data to the branch after committing the data that
needs to be deleted.

Deleting sensitive data - You should follow these
steps when you have committed sensitive data that you need to completely
purge from Pachyderm, such that no trace remains.

The simplest case is when you’ve just made a commit to a branch with some
incorrect, corrupt, or otherwise bad data. In this scenario, the HEAD of your branch
(i.e., the latest commit) is bad. Users who read from it are likely to be misled, and/or
pipeline subscribed to it are likely to fail or produce bad downstream output.

To fix this you should use delete-commit as follows:

$ pachctl delete-commit <repo> <branch-or-commit-id>

When you delete the bad commit, several things will happen (all atomically):

The commit metadata will be deleted.

Any branch that the commit was the HEAD of will have its HEAD set to the
commit’s parent. If the commit’s parent is nil, the branch’s HEAD will be set
to nil.

If the commit has children (commits which it is the parent of), those
children’s parent will be set to the deleted commit’s parent. Again, if the
deleted commit’s parent is nil then the children commit’s parent will be
set to nil.

Any jobs which were created due to this commit will be deleted (running jobs
get killed). This includes jobs which don’t directly take the commit as
input, but are farther downstream in your DAG.

Output commits from deleted jobs will also be deleted, and all the above
effects will apply to those commits as well.

Recovering from commits of bad data is a little more complicated if you’ve
committed more data to the branch after the bad data was added. You can
still delete the commit as in the previous section, however, unless the subsequent
commits overwrote or deleted the bad data, it will still be present in the
children commits. Deleting a commit does not modify its children.

In git terms, delete-commit is equivalent to squashing a commit out of existence.
It’s not equivalent to reverting a commit. The reason for this behavior is that the
semantics of revert can get ambiguous when the files being reverted have been
otherwise modified. Git’s revert can leave you with a merge conflict to solve,
and merge conflicts don’t make sense with Pachyderm due to the shared nature of
the system and the size of the data being stored.

In these scenario, you can also delete the children commits, however those commits
may also have good data that you don’t want to delete. If so, you should:

Start a new commit on the branch with pachctlstart-commit.

Delete all bad files from the newly opened commit with pachctldelete-file.

Finish the commit with pachctlfinish-commit.

Delete the initial bad commits and all children up to the newly finished
commit.

Depending on how you’re using Pachyderm, the final step may be optional. After
you finish the “fixed” commit, the HEADs of all your branches will converge to
correct results as downstream jobs finish. However, deleting those commits
allow you to clean up your commit history and makes sure that no one will ever
access errant data when reading non-HEAD version of the data.

If the data you committed is bad because it’s sensitive and you want to make
sure that nobody ever accesses it, you should complete an extra step in addition to those
above.

Pachyderm stores data in a content addressed way and when you delete
a file or a commit, Pachyderm only deletes references to the underlying data, it
doesn’t delete the actual data until it performs garbage collection. To truly
purge the data you must delete all references to it using the methods described
above, and then you must run a garbage collect with pachctlgarbage-collect.