Example Bioinformatics Pipeline

Today we are going to implement a solution for a task that features these problems.

Task: Get DNA sequences for entries in an annotation file

One might do genomic sequence analysis in order to study

motifs in UTR, promoter/upstream regions

Protein - codon usage

GC content

A common way to store gene annotations is with GFF (General Feature Format) files. There different versions of this format, including GTF (General Transfer Format), and the current version is 3.0. Despite the different versions, you may expect the following tab-delimited format:

You are now in your project space, which has 250Gb of space that isn't deleted.

Let's prepare a directory for the input data. The name data seems appropriate, but you may wish to be more specific.

$ mkdir data
$ cd data

Now we must go and get the data.

Get genomic sequence data

For C. elegans, we can access the publically-available sequence at https://genome.ucsc.edu/. The genome browser at the University of California, Santa Cruz is one of the leaders in genome annotation storage and analysis. They have many genomes, but your organism of interest may be maintained by a different project.

With wget

Make sure you copy the entire code block above, since it goes beyond the window.

The program wget is equivalent to using your internet browser, The program rsync has more capabilities (the files don't have to be on the WWW, for example).

Decompress the file

Either of the above commands will fetch the sequence in a Gzip-compressed, TAR file. This is a common format on the internet. TAR format is a handy way to package multiple files. Gzip is a standard compression method. To uncompress and extract the different chromosomes, you do:

tar -zxvf chromFa.tar.gz

This extracted seven files, one for each chromosome I-V (in roman numerals) and chromosome X (chrX) and the mitochondrial chromosome (chrM).

The program wc (word count) gives the number of lines, words, and characters in a file. Words are separated by whitespace, so the sequence lines only have one “word” per line. Hence, the first two columns of output are the same.

The third column, however, is useful.

Each file contains the characters of the header, e.g. >chrI, and the newline character (one per line). But the rest is the sequence count. This, the size of the C. elegans genome is roughly the value in the 3rd column total. It is on the order of 102 million basepairs.

Organize with directories

We now have the C. elegans genome sequence in FASTA format. Let's be a little more organized and put it in a directory.

Notice how we didn't need the ./ prefix to the program? By moving it to ~/bin, we can run it from whereever we are.

Thus, we have installed faFrag. Now we have the software to extract the specific subsequence out of a FASTA file, given chromosome positions (that are specified in our annotation file).

Let's begin coding our first script.

Matching file formats

The “canonical geneset” gtf file and the FASTA files we downloaded have a key discrepancy: the chromosome names. Anything from UCSC uses the convention of abbreviating the chromosome as “chr” with the number following it. Example: chrI for chromosome I (roman numeral one). Our gtf file doesn't follow that convention.

We could change our sequences, or our GTF file. For this exercise, we'll change the GTF file.

New command: sed

A handy command for this is sed, which stands for stream editor. We're going to use sed to make a new gene annotation file with the chromosome name changed to the “chr” convention.

USAGE

Sed takes a short, one-line instruction as its argument, and applies that instruction to either to a file argument or standard input <stdin>.

s - substitute

$ sed's/pattern/replacement/' filename

The parts of this instruction are separated by forward slashes (/).

s - substitute

pattern - an exact string, or a regular expression

replacement - an exact string that is substituted to matched specified by pattern

The replacement portion always ends with a forward slash, but can then be followed by further options (examples below).

Sed can take its input from a pipe instead a file, as in:

$ cmd |'s/pattern/replacement/'

pattern

The pattern can be an exact string. Say we want to replace MtDNA with chrM, the argument to sed would be:

$ sed's/MtDNA/chrM/' inputfile.gtf > outputfile.gtf

The pattern can include special characters in order to refine the selection. Common ones are:

$ sed's/^MtDNA/chrM/' inputfile.gtf > outputfile.gtf

The caret ^ means that the following text MtDNA must be at the beginning of the line for it to match. That's perfect for us, since it that's the text we need to change.

$ sed's/MtDNA$/chrM/' inputfile.gtf > outputfile.gtf

The dollar sign $ in this context means that the preceding text MtDNA must be at the end of the line for it to match. That's not the case for us, but may be useful in a different setting.

Note: none of these patterns accidentally match each other. If we needed to change III to chr3, however, we couldn't just use ^I, because it would match all of I, II, III, and IV. We would have to specify ^IV, ^III, and ^II individually before we could run ^I.

1 minute (default is 4 hours). Requesting less will make it easier to schedule our job in between others if it is busy.

We submit the job this way:

$ sbatch fix_chr_names.sbatch
Submitted batch job 1199849

This is not the same as running it with bash. It has submitted the job to the job scheduler. And is now placed in the queue. You can check on its status with:

$ squeue -u$USER

Also, look for the new file: chr_wormbase.gtf in your directory. I list my directories oldest-newest for this purpose. Newer files will be at the bottom:

$ ls-ltrh

Is your file there yet?

Once it is, we can move on to build the extraction code.

Candidate Genes

Let's choose some genes to look at. Oddly, there aren't common names or abbreviations in the file we downloaded, so we have to use the web interface to get the right gene identifiers.

At https://www.wormbase.org, we can use the search box for this function. If you have a gene that you study and think it might have an ortholog in C. elegans, try typing it in here. In general, developmental genes are highly conserved. Try searching for HOX.

My first hit was ceh-13. What we need is the WormBase ID.

To make sure you're getting results from C. elegans, click the link under species on the left.

#!/usr/bin/env bash#SBATCH --nodes=1#SBATCH --ntasks=1#SBATCH --time=1:00:00#SBATCH --qos=normal#SBATCH --output=pipeline-latest.out# pipeline.sbatch - Perform grep on a GFF file, then run faFrag to extract DNA for # each line returned by grep.annotations=chr_wormbase.gff
if[!-e"$annotations"]thenecho"Need to supply an annotation file">&2exit1fi# Specify Gene list file gene_lst=$1echo"using gene_lst=$gene_lst"## Perform grep -f on Gene list file to get annotation lines from the GFFgrep-f$gene_lst$annotations|grep five_prime_utr > tmpfile.gff
echo"temporary file contains $(wc -l tmpfile.gff) lines"### Loop over grep'd file## for each GFF linewhileread line
do## isolate column to specify chromosome ##chrom=$(echo"$line"|cut-f1)## isolate column to specify chrom_start ##chrom_start=$(echo"$line"|cut-f4)## isolate column to specify chrom_end ##chrom_end=$(echo"$line"|cut-f5)## --- fill in this section --- ##attributes=$(echo"$line"|cut-f9)## get the TRANSCRIPT ID out... There may be multiple transcripts for a given genegene_id=$(echo$attributes|cut-f4-d' ')## trim the quotes and semicolongene_id_trimmed=$(echo$gene_id|tr-d[\"\;])# have to use backslashes for some charactersin_fasta="c_elegans11/$chrom.fa"out_fasta="$gene_id_trimmed.fa"########### run the command! #############echo"Running faFrag on $gene_id_trimmed"
faFrag $in_fasta$chrom_start$chrom_end$out_fasta||echo">>>>>>>>>>>There was a problem with: '$line'<<<<<<<<<<<<<<<"done< tmpfile.gff
# can now delete tmpfile.gffrm tmpfile.gff

Running the command

We can now run the command with sbatch, passing in the argument of our gene list. It will be $1 inside the script.

$ sbatch gene_ids.lst
Submitted job XXXXXXX

You can watch the progress with:

$ squeue -u$USER
$ ls-ltrh
$ sacct -j XXXXXXX

The last one looks specifically at the specific job, but you have to paste-in the long ID returned by sbatch.

Did all of them work? IF not, can you tell why? What might we do to get the failed ones to work?