This is an introductory tutorial for learning to work reproducible in bioinformatics/genomics research.
This tutorial makes extensive use of the command-line interface.

We will be analysing RNA-seq/transcriptomics samples from yeast.
However, in the context of this tutorial the biological background or relevance is of no real importance to us.
The aim here is not to understand the data, or bioinformatics tools for that matter, but rather how to structure the analysis steps in a way that you or someone else can redo the analysis and derive at the same results.
Yes, the bioinformatics tools we are using here are real and the analyses performed here can be applied to other datasets too, however, all tools can be substituted for alternatives.

Note

The focus in this tutorial is not on the bioinformatics tools, but rather on the tools that facilitate reproducible analyses.

In this tutorial we will analyse public data stemming from a transcriptomics experiment (RNA-sequencing) using next-generation sequencing (NGS).
The associated publication is entitled “Dynamics of the Saccharomyces cerevisiae Transcriptome during Bread Dough Fermentation” and can be found here[ASLANKOOHI2013].
The associated data has been deposited at the Short Read Archive and can be found here (accession: PRJNA212389).
The final aim in this tutorial is to quantify the expression of genes in each sample.

Note

The data has been downloaded already and is being made available within a Git repository accompanying this tutorial. To facilitate timely analyses during this tutorial, the original data has been down-sampled.

An overview of a typical RNA-seq experiment can be seen in Fig. 1.1.
RNA gets extracted from samples of interest and reverse transcribed and sequenced as a proxy for gene expression of the sample (either a set of cells or single cell).

We will be using a traditional set-up for analysing RNA-seq data, where sequenced reads will be cleaned, mapped to a reference genome, and finally reads per transcript/gene-model counted.
The workflow is summarised in Fig. 1.2.

Fig. 1.2 The tutorial will analyse data using this workflow.

Quality control - We will be filtering reads based on read quality.

Read mapping - We Will be using a tool for mapping reads to the human genome.