In many areas of natural and social science, as well as engineering, data
analysis involves a series of transformations: filtering, aggregating,
comparing to theoretical models, culminating in the visualization and
communication of results.
This process is rarely static, however, and
components of the analysis pipeline are frequently subject to replacement
and refinement, resulting in challenges for reproducing computational
results.
Describing data analysis as a directed network of transformations
has proven useful for translating between human intuition and computer
automation.
In the past I've evangelized extensively for GNU Make,
which takes advantage of this graph representation to enable incremental builds
and parallelization.

Snakemake is a next-generation tool based on this concept and designed
specifically for bioinformatics and other complex, computationally
challenging analyses.
I've started using Snakemake for my own data analysis projects, and I've
found it to be a consistent improvement, enabling more complex pipelines with
fewer of the "hacks" that are often necessary when using Make.

I've taught Make workshops in the past,
so, when I was invited to present to the Boise State University Joint
User Groups, I was excited to convert that tutorial to Snakemake.
Here, I've converted that tutorial into a blog post.
The original (and therefore this lesson as well) is inspired by the
Software Carpentry Make lesson,
to which I am also a contributor.

Some prior experience with the command line is assumed, and learners are
encouraged to follow along on their own computers.
The entire tutorial, including questions for the learner are designed to
take 2 hours as a live-coded, Software Carpentry style lesson.
A standalone lesson repository can be found here and is
licensed CC-BY.

Setup

If not already on your computer, install the following prerequistes.

A Bash terminal

Python 3.6 and the following packages

snakemake

matplotlib

numpy

The lesson assumes the following programs are also installed, but
they're not absolutely necessary for the flow of the lesson,
and/or alternatives are widely available:

head

nano

dot

tree

This example directory should be downloaded to the user's
desktop and navigated into at the command line.
(e.g. git clone https://github.com/bsmith89/zipf-example; cd zipf-example)

Motivation

Zipf's Law [10 minutes]

The most frequently-occurring word occurs approximately twice as
often as the second most frequent word. This is
Zipf's Law.

Let's imagine that we're interested in testing Zipf's law in some of our
favorite books.
We've compiled our raw data: the books we want to analyze,
and have prepared several Python scripts that together make up our
analysis pipeline.

The tree command produces a handy tree-diagram of the directory.
(You may not have this program installed on your computer.)

Here you see that we're starting with a well designed project directory.
The raw data (books) are stored in their own directory, and scripts have
informative names.

Let's take a look at our raw data

head books/isles.txt

Our first step is to count the frequency of each word in a book.

scripts/wordcount.py books/isles.txt isles.tsv

Let's take a quick peek at the result.

head -5 isles.tsv

shows us the top 5 lines in the output file:

the 3822 6.7371760973
of 2460 4.33632998414
and 1723 3.03719372466
to 1479 2.60708619778
a 1308 2.30565838181

Each row shows the word itself, the number of occurrences of that
word, and the number of occurrences as a percentage of the total
number of words in the text file.

We can do the same thing for a different book:

scripts/wordcount.py books/abyss.txt abyss.tsv
head -5 abyss.tsv

Finally, let's visualize the results.

scripts/plotcount.py isles.tsv ascii

The ascii argument has been added so that we get a text-based
bar-plot printed to the screen.

The script is also able to render a graphical bar-plot using matplotlib
and save the figure to a named file.

scripts/plotcount.py isles.tsv isles.png

Together these scripts implement a common workflow:

Read a data file.

Perform an analysis on this data file.

Write the analysis results to a new file.

Plot a graph of the analysis results.

Save the graph as an image, so we can publish it.

Writing a script to do our analysis [5 minutes]

Running this pipeline for one book is relatively simple using the command-line.
But once the number of files and the number of steps in the pipeline
expands, this can turn into a lot of work.
Plus, no one wants to sit and wait for a command to finish, even just for 30
seconds.

The most common solution to the tedium of data processing is to write
a master script that runs the whole pipeline from start to finish.

It allows us to type a single command, bash analysis.sh, to
reproduce the full analysis.

It prevents us from repeating typos or mistakes.
You might not get it right the first time, but once you fix something
it'll (probably) stay that way.

What are the problems with this approach? [10 minutes]

A master script is a good start, but it has a few shortcomings.

Let's imagine that we adjusted the width of the bars in our plot
by editing scripts/plotcount.py;
in the function definition for
plot_word_counts, width = 1.0 is now width = 0.8.

Now we want to recreate our figures.
We couldbash analysis.sh again.
That would work, but it could also be a big pain if counting words takes
more than a few seconds.
The word counting routine hasn't changed; we shouldn't need to recreate
those files.

Alternatively, we could manually rerun the plotting for each word-count file
and recreate the archive.

But this process, and subsequently undoing it,
can be a hassle and source of errors in complicated pipelines.

What we really want is an executable description of our pipeline that
allows software to do the tricky part for us:
figuring out what steps need to be rerun.
It would also be nice if this tool encourage a modular analysis
and reusing instead of rewriting parts of our pipeline.
As an added benefit, we'd like it all to play nice with the other
mainstays of reproducible research: version control, Unix-style tools,
and a variety of scripting languages.

Snakemake background [5 minutes]

Snakemake comes from a lineage of computer programs—most notably
Make—originally designed to
automate the compilation and installation of software.
Programs like Make automate the building of target files through a series of
discrete steps.
Despite the original purpose, this design makes it a great fit for
bioinformatics pipelines, which usually work by transforming data from one form
to another
(e.g. raw data → word counts → ??? → profit).

Snakemake is inspired by this approach, but designed specifically for
computationally intensive and/or complex data analysis pipelines.
The name is a reference to the programming language Python, which forms
the basis for the Snakemake syntax.
You don't need to be an expert at Python to use Snakemake, but it can
sometimes be very useful.
There are pros and cons to using Snakemake versus any other analysis pipeline
tools, and it is worth considering other options, including:

GNU Make

doit

Galaxy

Tutorial

Writing and Running Snakefiles [10 minutes]

Let's get started writing a description of our analysis for Snakemake.

We have now written the simplest, non-trivial snakefile.
The shell: line is pretty reminiscent of one of the lines from our master
script.
I bet you can already see what this snakefile means.

Let's walk through what we've written.
The first line uses the keyword rule followed by the name of our rule:
wordcount_isles.
We end that line with a colon.
All of the following lines in our rule are indented with four spaces.
The second line says that it takes an input file, using the input
keyword which is again followed by a colon.
We then give it the path to this prerequisite (books/isles.txt), wrapped in
quotes.
The third line does the same thing with the output file (isles.tsv).
And the last line is the exact shell command that we used in our shell script
earlier to create the target output file.
Like scripting, Snakemake allows us to wrap a series of shell commands, but
is more expressive and flexible than a script.

Our snakefile describes a (very short) pipeline:

We are generating a file called isles.tsv

Creating this file requires books/isles.txt

The command to create this file runs the script runs wordcount.py

We'll think about our pipeline as a network of files that are dependent
on one another.
Right now our Snakefile describes a pretty simple dependency graph.

books/isles.txt → isles.tsv

where the "→" is pointing from requirements to targets.

Running Snakemake

Now that we have a (currently incomplete) description of our pipeline,
let's use Snakemake to execute it.

By default, Snakemake prints a summary of the recipes that it
executes.

Let's see if we got what we expected.

head -5 isles.tsv

The first 5 lines of that file should look exactly like before.

Re-running Snakemake

Let's try running Snakemake the same way again.

snakemake isles.tsv

This time, instead of executing the same recipe,
Snakemake prints Nothing to be done.

What's happening here?

When you ask Snakemake to make isles.tsv it first looks at
the modification time of that target.
Next it looks at the modification time for the target's prerequisites.
If the target is newer than the prerequisites Snakemake decides that
the target is up-to-date and does not need to be remade.

Much has been said about using modification times as the cue for remaking
files.
This can be another Snakemake gotcha, so keep it in mind.

If you want to induce the original behavior, you only have to
change the modification time of books/isles.txt so that it is newer
than isles.tsv.

touch books/isles.txt
snakemake isles.tsv

The original behavior is restored.

Sometimes you only want Snakemake to tell you what it thinks about the
current state of your files.
snakemake --dryrun isles.tsv will print Snakemake's execution plan,
without actually carrying it out.
The flag can also be abbreviated as -n.

If you don't pass a target as an argument to snakemake (i.e. run
snakemake) it will assume that you want to build the first target in the
snakefile.

Here the recipe for zipf_results.tgz takes multiple input files,
each of which must be quoted and separated by commas, and involves
involves running a series of shell commands.
When building the archive, Snakemake will run each line successively unless
any return an error.

Question

Without doing it, what happens if you run snakemake isles.png?

Challenge

What does the dependency graph look like for your Snakefile?

Try it

What happens if you run snakemake zipf_results.tgz right now?

Practice

Write a recipe for abyss.png.

Once you've written a recipe for abyss.png you should be able to
run snakemake zipf_results.tgz.

Let's delete all of our files and try it out.

rm abyss.* isles.*
snakemake zipf_results.tgz

You should get the something like the following output
(the order may be different)
to your terminal:

Since you asked for zipf_results.tgzSnakemake looked first for that file.
Not finding it, Snakemake looked for its prerequisites.
Since none of those existed it remade the ones it could,
abyss.tsv and isles.tsv.
Once those were finished it was able to make abyss.png and
isles.png, before finally building zipf_results.tgz.

Try it

What happens if you touch abyss.tsv and
then snakemake zipf_results.tgz?

Running Snakemake in parallel

And check this out!

snakemake clean
snakemake --threads 2

Did you see it?
The --threads 2 flag (just "-j2" works too) tells Make to run recipes in
two parallel threads.
Our dependency graph clearly shows that
abyss.tsv and isles.tsv are mutually independent and can
both be built at the same time.
Likewise for abyss.png and isles.png.
If you've got a bunch of independent branches in your analysis, this can
greatly speed up your build process.

Phony targets

Sometimes we want to build a bunch of different files simultaneously.

rule all:
input: "isles.png", "abyss.png"

Even though this rule doesn't have a recipe, it does have prerequisites.
Now, when you run snakemake allSnakemake will do what it needs to to bring
both of those targets up to date.

It is traditional for "all" to be the first recipe in a snakefile,
since the first recipe is what is built by default
when no other target is passed as an argument.

Another traditional target is "clean".
Add the following to your snakefile.

rule clean:
shell: "rm --force *.tsv *.png zipf_results.tgz"

Running snakemake clean will now remove all of the cruft.

Diagramming the DAG [5 minutes]

(If you'd prefer not to bake this Snakefile from scratch, you can
get one we've been hiding in the oven the whole time:
cp .extra/Snakefile.1 Snakefile)

Looks good, don't you think?
Notice the added comments, starting with the "#" character just like in
Python, R, shell, etc.

Using these recipes, a simple call to snakemake builds all the same files
that we were originally making either manually or using the master script, but
with a few bonus features.

Now, if we change one of the inputs, we don't have to rebuild everything.
Instead, Snakemake knows to only rebuild the files that, either directly or
indirectly, depend on the file that changed.
This is called an incremental build.
It's no longer our job to track those dependencies.
One fewer cognitive burden getting in the way of research progress!

In addition, a snakefile explicitly documents the inputs to and outputs
from every step in the analysis.
These are like informal "USAGE:" documentation for our scripts.

It is worth pointing out that our pipeline (and every pipeline) must be
acyclic: no file can be an input to itself or to any of its inputs, ad
infinitum.
Officially we talk about the relationships between files as a Directed Acyclic
Graph (DAG).
While earlier we took the time to diagram our DAG by hand, Snakemake
has tools for plotting this network automatically.

snakemake --dag zipf_results.tgz | dot -Tpng > dag.png

Open that file and check it out.

Diagrams like this one can be a very useful way to debug problems with an
analysis pipeline.

Don't repeat yourself

In many programming language, the bulk of the language features are there
to allow the programmer to describe long-winded computational routines as
short, expressive, beautiful code.
Features in Python or R like user-defined variables and functions are
useful in part because they mean we don't have to write out (or think about)
all of the details over and over again.
This good habit of writing things out only once is known as the D.R.Y.
principle.

In Snakemake a number of features are designed to minimize repetitive code.
Our current snakefile does not conform to this principle,
but Snakemake is perfectly capable of doing so.

Automatic variables [10 minutes]

Question

What are some of the repetitive components of our snakefile?

One overly repetitive part of our Snakefile:
Inputs and outputs are in both the header and the recipe of each rule.

Here we've replaced the input "books/isles.txt" in the recipe
with "{input}" and the output "isles.dat" with "{output}".
Both "{input}" and "{output}" are placeholders that refer to all of the
prerequisites and target of a rule, respectively.
In Snakemake, placeholders are all wrapped in opening and closing brackets,
and are replaced with the value of that variable at runtime.
If you are familiar with modern Python format strings, that's where the syntax
comes from.

That's a little less cluttered,
and still perfectly understandable once you know what the variables mean.
The best part, is that if I want to change the input files, I only need to
edit my snakefile in one place.

Try it

```bash
snakemake clean
snakemake isles.tsv
``````````

You should get the same output as last time.
Internally, Snakemake replaced "{output}" with "isles.tsv"
and "{input}" with "books/isles.txt"
before running the recipe.

Practice

Go ahead and rewrite all of the rules in Snakefile to minimize
repetition and take advantage of the "{input}" and "{output}"
placeholders.

Wildcard Filenames [10 minutes]

Another deviation from D.R.Y.:
We have nearly identical recipes for abyss.tsv and isles.tsv.

It turns out we can replace both of those rules with a single rule,
by telling Snakemake about the relationships between filenames with
wildcards.

Here we've replaced the book name with "{name}".
The "{name}" matches any part of the input filename between "books/"
and ".txt", and must be the same as "{name}" in the output filename.
You don't have to use "name" as your wildcard name, and you should be
descriptive.

This rule can be interpreted as:

In order to build a file named [something].tsv (the target)
find a file named books/[that same something].txt (the prerequisite)
and run scripts/wordcount.py [the prerequisite] [the target].

Notice how helpful the automatic input/output variables were here.
This recipe will work no matter what stem is being matched!

Go ahead and make this change in your snakefile.

Try it

After you've replaced the two rules with one
rule using wildcards, try removing all of the products (snakemake clean)
and rerunning the pipeline.

Is anything different now that you're using the new, universal rule?

Practice

Replace the rules for abyss.png and isles.png
with a single rule.

Challenge

Add books/sierra.txt to your pipeline.

(i.e. snakemake all should plot the word counts and add the plots to
zipf_results.tgz)

Scripts as prerequisites [10 minutes]

We've talked a lot about the power of Snakemake for
rebuilding research outputs when input data changes.
When doing novel data analysis, however, it's very common for our scripts to
be as or more dynamic than the data.

What happens when we edit our scripts instead of changing our data?

Try it

First, run snakemake all so your analysis is up-to-date.

Let's change the default number of entries in the rank/frequency
plot from 10 to 5.

(Hint: edit the function definition for plot_word_counts in
plotcount.py to read limit=5.)

Now run snakemake all again. What happened?

As it stands, we have to run snakemake clean followed by snakemake all
to update our analysis with the new script.
We're missing out on the benefits of incremental analysis when our scripts
are changing too.

There must be a better way...and there is.
Scripts should be considered inputs too!

Let's edit the rule for {name}.png to include plotcount.py
as an input.

This recipe works because "{input.script}" is replaced with
"scripts/plotcount.py"
and "{input.data}" with the appropriate expansion of "{name}.tsv".
When building abyss.png, for instance,
"{input.script} {input.data} {output}" becomes
"scripts/plotcount.py abyss.tsv abyss.png", which is exactly what we want.

Try it

What happens when you run the pipeline after modifying your script again?

(Changes to your script can be simulated with touch plotcount.py.)

Practice

Update your other rules to include the relevant scripts as inputs.

(Final snakefile: cp .extra/Snakefile.3 Snakefile)

Conclusion [1 minutes]

I hope that I've convinced you of the value of Snakemake for data analysis.
What I have shown you today barely scratches the surface of the software's
functionality;
I encourage you to check out the website.
In my experience, though, the topics we've gone over today already provide
90% of the benefits:
we can forget about script names
and intermediate steps and focus instead on the output files that we want.
This 'declarative' approach to pipelines
pipelines has transformed the way I do data analysis.
I think it can do the same for you.