Linux command line exercises for NGS data processing

The purpose of this tutorial is to introduce students to the frequently used tools for NGS analysis as well as giving experience in writing one-liners.
Copy the required files to your current directory, change directory (cd) to the linuxTutorial folder, and do all the processing inside:

I have deliberately chosen Awk in the exercises as it is a language in itself and is used more often to manipulate NGS data as compared to the other command line tools such as grep, sed, perl etc. Furthermore, having a command on awk will make it easier to understand advanced tutorials such as Illumina Amplicons Processing Workflow.

In Linux, we use a shell that is a program that takes your commands from the keyboard and gives them to the operating system. Most Linux systems utilize Bourne Again SHell (bash), but there are several additional shell programs on a typical Linux system such as ksh, tcsh, and zsh. To see which shell you are using, type

[uzi@quince-srv2 ~/linuxTutorial]$ echo $SHELL
/bin/bash

To see where you are in the file system:

[uzi@quince-srv2 ~/linuxTutorial]$ pwd
/home/uzi/linuxTutorial

List the files in the current directory:

[uzi@quince-srv2 ~/linuxTutorial]$ ls
data

Now try different commands from the sheet given below:

Linux Commands Cheat Sheet

File System

ls — list items in current directory

ls -l — list items in current directory and show in long format to see perimissions, size, and modification date

ls -a — list all items in current directory, including hidden files

ls -F — list all items in current directory and show directories with a slash and executables with a star

ls dir — list all items in directory dir

cd dir — change directory to dir

cd .. — go up one directory

cd / — go to the root directory

cd ~ — go to to your home directory

cd - — go to the last directory you were just in

pwd — show present working directory

mkdir dir — make directory dir

rm file — remove file

rm -r dir — remove directory dir recursively

cp file1 file2 — copy file1 to file2

cp -r dir1 dir2 — copy directory dir1 to dir2 recursively

mv file1 file2 — move (rename) file1 to file2

ln -s file link — create symbolic link to file

touch file — create or update file

cat file — output the contents of file

less file — view file with page navigation

head file — output the first 10 lines of file

tail file — output the last 10 lines of file

tail -f file — output the contents of file as it grows, starting with the last 10 lines

vim file — edit file

alias name 'command' — create an alias for a command

System

shutdown — shut down machine

reboot — restart machine

date — show the current date and time

whoami — who you are logged in as

finger user — display information about user

man command — show the manual for command

df — show disk usage

du — show directory space usage

free — show memory and swap usage

whereis app — show possible locations of app

which app — show which app will be run by default

Process Management

ps — display your currently active processes

top — display all running processes

kill pid — kill process id pid

kill -9 pid — force kill process id pid

Permissions

ls -l — list items in current directory and show permissions

chmod ugo file — change permissions of file to ugo - u is the user's permissions, g is the group's permissions, and o is everyone else's permissions. The values of u, g, and o can be any number between 0 and 7.

7 — full permissions

6 — read and write only

5 — read and execute only

4 — read only

3 — write and execute only

2 — write only

1 — execute only

0 — no permissions

chmod 600 file — you can read and write - good for files

chmod 700 file — you can read, write, and execute - good for scripts

chmod 644 file — you can read and write, and everyone else can only read - good for web pages

chmod 755 file — you can read, write, and execute, and everyone else can read and execute - good for programs that you want to share

Exercise 1: Extracting reads from a FASTA file based on supplied IDs

Awk is a programming language which allows easy manipulation of structured data and is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that match with the specified patterns and then perform associated actions.
The basic syntax is:

awk '/pattern1/ {Actions}
/pattern2/ {Actions}' file

The working of Awk is as follows

Awk reads the input files one line at a time.

For each line, it matches with given pattern in the given order, if matches performs the corresponding action.

If no pattern matches, no action will be performed.

In the above syntax, either search pattern or action are optional, But not both.

If the search pattern is not given, then Awk performs the given actions for each line of the input.

If the action is not given, print all that lines that matches with the given patterns which is the default action.

Empty braces with out any action does nothing. It wont perform default printing operation.

Awk has number of builtin variables. For each record i.e line, it splits the record delimited by whitespace character by default and stores it in the $n variables.
If the line has 5 words, it will be stored in $1, $2, $3, $4 and $5. $0 represents the whole line. NF is a builtin variable which represents the total number of fields in a record.

We can also use the concept of a conditional operator in print statement of the form print CONDITION ? PRINT_IF_TRUE_TEXT : PRINT_IF_FALSE_TEXT. For example, in the code below, we identify sequences with lengths > 14:

We can also use 1 after the last block {} to print everything (1 is a shorthand notation for {print $0} which becomes {print} as without any argument print will print $0 by default), and within this block, we can change $0, for example to assign the first field to $0 for third line (NR==3), we can use:

You can have as many blocks as you want and they will be executed on each line in the order they appear, for example, if we want to print $1 three times (here we are using printf instead of print as the former doesn't put end-of-line character),

You can also use getline to load the contents of another file in addition to the one you are reading, for example, in the statement given below, the while loop will load each line from test.tsv into k until no more lines are to be read:

After looking at the file, it is immediately clear that the sequences may span multiple lines (for example, for blah_C1). If we want to match an ID, we can first linearize the file by using the conditional operator as discussed above to have the delimited information of each sequence in one line, and then make logic to perform further functionality on each line later.
Our logic is that for lines that contain header information /^>/ we can do something differently, and for other lines we use printf to remove new line character:

We can then pipe this stream to another awk statement using "\t" as a delimeter (which you can specify using -F) and use gsub to remove > from the start of each line since our IDs file doesn't contain that character:

Exercise 2: Alignment Statistics for Metagenomics/Population Genomics

For this exercise we will use a C. Difficile Ribotype 078 reference database that comprises of 61 contigs. Even though it is a single genome for which we have obtained the samples, the workflow given below remains similar for the metagenomic samples when you have complete genomes instead of contigs in the reference database (and so I use the nomenclature: genomes/contigs).
Before we analyze our samples, we can do some quality control checks on our raw sequences using FastQC.
Running the following command will generate a M120_S2_L001_R1_001_fastqc folder with an html page fastqc_report.html inside.
You can load it up in your browser to assess your data through graphs and summary tables.

For example, here is the file generated for the above M120_S2_L001_R1_001.fastq file:
Alternatively, you can also try my Shell utilities for QC as well as Shell wrappers for EMBOSS utilities.
Next we index our reference database file. Indexing speeds up alignment, allowing the aligner to quickly find short, near-exact matches to use as seeds for subsequent full-alignments.

[uzi@quince-srv2 ~/linuxTutorial/data]$ bwa index Cdiff078.fa

Use BWA-MEM to align paired-end sequences. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW).
From BWA doc, it is suggested that for 70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences, BWA-MEM is usually the preferred algorithm. For short sequences, BWA-backtrack may be better. BWA-SW may have better sensitivity when alignment gaps are frequent.

We have generated a sam file (aln-pe.sam) which consist of two types of lines: headers and alignments. Headers begin with @, and provide meta-data regarding the entire alignment file.
Alignments begin with any character except @, and describe a single alignment of a sequence read against the reference genome.
Note that each read in a FASTQ file may align to multiple regions within a reference genome, and an individual read can therefore result in multiple alignments.
In the SAM format, each of these alignments is reported on a separate line. Also, each alignment has 11 mandatory fields, followed by a variable number of optional fields. Each of the fields is described in the table below:

Col

Field

Description

1

QNAME

Query template/pair NAME

2

FLAG

bitwise FLAG

3

RNAME

Reference sequence NAME

4

POS

1-based leftmost POSition/coordinate of clipped sequence

5

MAPQ

MAPping Quality (Phred-scaled)

6

CIAGR

extended CIGAR string

7

MRNM

Mate Reference sequence NaMe (= if same as RNAME)

8

MPOS

1-based Mate POSistion

9

TLEN

inferred Template LENgth (insert size)

10

SEQ

query SEQuence on the same strand as the reference

11

QUAL

query QUALity (ASCII-33 gives the Phred base quality)

12+

OPT

variable OPTional fields in the format TAG:VTYPE:VALUE

where FLAG is defined as:

Flag

Chr

Description

0x0001

p

the read is paired in sequencing

0x0002

P

the read is mapped in a proper pair

0x0004

u

the query sequence itself is unmapped

0x0008

U

the mate is unmapped

0x0010

r

strand of the query (1 for reverse)

0x0020

R

strand of the mate

0x0040

1

the read is the first read in a pair

0x0080

2

the read is the second read in a pair

0x0100

s

the alignment is not primary

0x0200

f

the read fails platform/vendor quality checks

0x0400

d

the read is either a PCR or an optical duplicate

Since the flags are given in decimal representation in the SAM file, you can use this link to check which flag is set.
We are going to use SAMTools which provides various tools for manipulating alignments in the SAM/BAM format.
The SAM (Sequence Alignment/Map) format (BAM is just the binary form of SAM) is currently the de facto standard for storing large nucleotide sequence alignments. If you are dealing with high-throughput metagenomic whole-genome shotgun sequencing data, you will have to deal with SAM/BAM files. See what SAMtools have to offer:

We can then use a program SAMstat to get statistics on our aln-pe.sam file:

[uzi@quince-srv2 ~/linuxTutorial]$ samstat aln-pe.sam

Running the above code will generate a aln-pe.sam.samstat.html file which you can open in your browser (be patient, it takes a bit of time to load). Plots such as "Reads Length Distributions" and "Base Quality Distributions" may be of interest to you:
Now we convert SAM file to the binary BAM file

We will now use bedtools. It is a very useful suite of programs for working with SAM/BAM, BED, VCF and GFF files, files that you will encouter many times doing NGS analysis. -ibam switch takes indexed bam file that we generated earlier, -d reports the depth at each genome position with 1-based coordinates, and -g is used to provide the genome lengths file we generated earlier.
The coverage flags are explained pictorially from genomecov man page:
Reference: http://bedtools.readthedocs.org/en/latest/_images/genomecov-glyph.png

We will now use the above file with lengths.genome to calculate the proportions of genomes/contigs covered using the following one-liner. It reads lengths.genome line by line, assigns the genome identifier to myArray[0], it's length to myArray[1]. It then searches the identifier in aln-pe.mapped.bam.perbase.count, extracts the base count, and uses bc to calculate the proportions.

The detailed description of these summary metrics are given here. From this link, PF_MISMATCH_RATE, PF_HQ_ERROR_RATE, and PF_INDEL_RATE are of interest to us.
As can be seen, the error rates are quite low and we can proceed with the analysis. Next we would like to calculate GC bias. For this purpose, we will index aln-pe.mapped.sorted.bam file.

Do not get intimidated by the perl one liner in the above statement. I have extracted it from my GENERATEtable.sh script. If the data stream is of the form [Contig]\t[Feature]\t[Value], then you can pipe the stream to GENERATEtable.sh to obtain a Contig X Feature table:

Another useful tool is Qualimap which offers Multi-sample BAM QC.
To use it, we need to generate input.txt file which contains listing of BAM files we want to compare. To save time, I am only considering 3 files:

This will generate a qualimap folder that will contain a multisampleBamQcReport.html file that you can load in your browser. Do check the "PCA" and "Insert Size Histogram" plot.
You can use samtools flagstat to get mapping statistics: