Snakemake is a workflow engine, developed for creating scalable bioinformatics and genomics workflows [KOESTER2012].
It borrows ideas from an old system for compiling and link a program Make and extends the ideas to help with bioinformatics pipelines.
We will be using Snakemake to run our complete analysis for us, from start to end.
Snakemake will make sure that our jobs are run in the correct order and will recognise if jobs have already been run and thus do not have to be run again.
It will also recognise if input files changed and thus jobs have to be re-run.
Correctly configured, Snakemake will take care of error logging, benchmarking, and simultaneous execution of our jobs (it is also able to distribute jobs to computer clusters).
We will see that combined with conda it makes for a powerful system for developing and running reproducible analyses workflows.

This directory contains all the files we need to do this tutorial.
There are four fastq-files in the fastq directory that we want to clean and map to the reference genome.
Finally, we will count the reads per gene and per sample.
The complete workflow is depicted in Fig. 4.1.

Note

The repository contains the downloaded the genome, the annotation, and the samples already. This can be done as well via Snakemake but goes beyond the topic of this tutorial. Should you be interested to see how this was done later on, you can have a look here and here.

We can of course do this analysis without any workflow management system and write down the commands one by one.
Given that we only have four samples, this is not particular difficult.
However, this process does not scale well if we decide to do it for 400 samples.
So we are later going to use a workflow management system that creates for us the commands for each sample without us doing it explicitly.
However, here we are going to look at each step and write down the command for one of the samples (SRR941826) to understand what is involved.

The purpose of this step is to remove bases from the ends of the reads that are of bad quality.
There are many tools that can do this for us.
Here, we are going to use the program Sickle to perform this task [JOSHI2011].
The program is easy to use.

After installing Sickle, running sickle by itself will print the help:

# install sickle
$ conda install sickle-trim
$ sickle

Running sickle with either the single-end (se) or paired-end (pe) reads:

Great, we have done our analysis of all four samples.
Now, we can export our conda environment and save the information in a file:

$ conda activate tutorial
$ conda env export > tutorial.yaml

This file contains the tools and their versions that we used in this analysis.
We could give this file to someone else and they could, given they work on the same operating system, recreate the same conda environment and redo the analysis.

So, generally we could use the programs on the command-line like shown above with one sample after the other.
However, we do not want to do this, as in this manner we do not save the information about the commands we used.
We could of course put all commands in a bash-script and in this manner remember all commands that have been run.

However, we be still not be able to keep this approach general to run it again and again on different numbers of samples.
We can do better!
This is were Snakemake comes into play.

We will engineer now this workflow in the workflow management system Snakemake.
However, it will be general with no filenames hard-coded, so that we can run the same workflow on any arbitrary number of fastq-files of the same type, here single-end reads.

In this example, we have an input file and an output file, as well as a way to get from the input to the output via a shell command.
Rules can either use shell commands, plain Python code or external scripts to create output files from input files.
Curly brackets define wildcards that get substituted.

Once, you have written lots of rules, Snakemake determines the rule dependencies by matching file names.

The command in the shell keyword section will be used to trim the data.
However, before it is executed, Snakemake will replace the wildcards in the command with the proper values defined in the other sections of the rule.
Of note, we introduce here as well a keyword params (highlighted lines), with which one can add more flexibility to the values that get substituted in the shell command.

Fine, but what is happening with the strange {sample}wildcard?
The wildcard will be replaced by Snakemake to try and match our requested final targets.
However, we have not defined any targets yet.

Snakemake needs to know for what to run this rule.
We need to define result or target files we want to create.

Note

Snakemake works by matching file-names, i.e. finding rules that can generate the requested files from other files.

Lets define some targets.
We are creating a pseudorule (all) that only defines inputs, which are our expected final targets.
In this way, Snakemake finds the rule all and tries to figure out which rules can be run to create the desired output target files.
It will find that our rule trimse can accomplish this when run four times with four different input files by substituting the {sample}wildcard.

We defined four target files.
Let us look what happens here for one of the target files when you run Snakemake.

Snakemake finds that you request to create the file analyses/results/SRR941826.trimmed.fastq.gz.

Snakemake will scan all rules, to identify which rule can create this file (Snakemake will try to substitute any wildcards like {sample} and try to match file names).

In our case, it will find that substituting {sample} in the output section of rule trimse (analyses/results/{sample}.trimmed.fastq.gz) with SRR941826 will match the requested target file name analyses/results/SRR941826.trimmed.fastq.gz

Snakemake will check if the input file of rule trimse (fastq/{sample}.fastq.gz) with a substituted {sample} part with SRR941826 (fastq/SRR941826.fastq.gz) exists.

If this file can be found the rule will be scheduled for execution. If the input cannot be found, Snakemake will complain that the requred input for creating the requested file is missing.

Ok, this Snakefile can already be run with the snakemake command.
We can do a dry-run (without actually running anything) to see the commands that Snakemake would execute.
We use the snakemake flag -n for dry-run and -p to see the commands that Snakemake would execute:

Nice, the Snakemake correctly substituted the files in input and output to create the correct commands for trimming.

Note

We used explicit file names for the expected target files in rule all based on the input sample file names we new existed in the fastq directory . However, we want to be general to be able to run any files located in the fastq directory without explicitly listing them. Snakemake should identify the input files and create expected target file names automatically based on the input file names.

Lets rewrite it a bit to make the workflow more general, so that any file located in the fastq directory that has the right file name structure is found by Snakemake:

We use an inbuilt function glob_wilcards to scan the fastq directory for files of a certain structure and extract the sample identifier out of the file names.

We use another inbuilt function expand in rule all to create a new “target” file name for each sample identifier collected in the step before.

Running the same snakemake-np command again, will yield the same result as in the explicit case.
However, now it would not matter if we would add another 100 files to the fastq directory.
Snakemake would find them, without us doing anything else.

We are adding another keyword (log), specify the file where we want the logging to end up in (of note: it also contains the same wildcard as the input and output files, thus gets substituted as well), and add the error redirection to the wildcardlog in the shell command.

We add another keyword (benchmark) and specify the file where we want the data to end up in. We do not do anything else. Snakemake will take care of benchmarking for us.

Note

You should realise that the shell command can be written over several lines, like in the example above.

If you rerun the snakemake command logging and benchmarking will be visible in the execution plan of Snakemake.
Because we specified the same wildcard in the benchmark file name, it gets substituted with the sample identifier too.

Now that we are after reproducibility, we need to integrate package management into the workflow.
This is easily done with Snakemake using conda.
We can add another keyword argument to our rule that specifies the conda environment that will be activated before running the particular rule.

You can find a minimal conda environment file for sickle in the envs directory: sickle.yaml.

The file specifies that the rule should be run with sickle version 1.33.
Before running anything, Snakemake will create the environment and store it in the current working directory in a subdirectory of .snakemake.
It will only be recreated if the yaml file changes.
Otherwise, if Snakemake is rerun it will use the already created environment.

Internally, Snakemake is creating a directed acyclic graph (DAG) of the rules and their dependencies.
We can generate a graphical visualisation of the graph and thus workflow (Fig. 4.4) using Snakemake and Graphviz:

$ snakemake --dag | dot -Tpng > dag.png

Fig. 4.4 The workflow v5 as a directed acyclic graph.

Note

This visualisation becomes bigger and bigger with more and more samples.

On line 5 we are changing the final output to bam files, so that the mapping is run for all samples.

The shell command to index the genome with BWA does not create an output. Thus, we create a pseudo output. On line 28 We are using the function touch to create an empty file after the rule is successfully run. We require this file as input for the map rule on line 41 so that the indexing rule is run before any mapping happens.

Line 41 also shows that we can have more than one input to a rule.

We see a new keyword in a rule on line 48threads. This can be used to specify the number of threads allowed for the rule when running snakemake with the --jobsNUM flag.

Line 54 and 55 show that we can chain shell commands easily in a Snakemake rule.

We are left with one requirement that we wanted to address in this tutorial:

Publishing & versioning the workflow information, as to keep track of when workflows change and what changes occurred.

After we created our workflow and our tool specifications in form of yaml files, we can make sure that others are able to easily get to our workflow specifications.
The easiest way to facilitate this, is to put your directory under Git version control and push your repository to a remote provider like GitLab or GitHub.
The URL to this repository can be added to manuscripts or sent via emails to collaborators so that others can easily download the repository and thus redo the analysis with the same settings, hence we are fulfilling our last requirement above.

It might be tempting to add the input data as well to the repository. However, in most cases it will be very big and you are better of using a remote storage location for the input data, e.g. Dropbox, Google Cloud Storage, FTP, etc. Snakemake can deal with all of those remote locations and can download data on demand. An example can be seen in the example file “Snakemake_v8” (see also below), where I request the input data from a Google Cloud Storage bucket. Of course there might be fees associated with storing data in the cloud. For proper reproducibility it should be a persistent location. For genomic data, free storage solutions like NCBI Short Read Archive or Gene Expression Ominbus can be used.