What are BAM files?

You might not be familiar with all these formats, but the ones we are interested in for now are SAM and it’s compressed form BAM. We will refer here to BAM files, because these are the kind of files which are kept most often because they are smaller (there is a SAMtools utility for converting SAM to BAM and BAM to SAM).

SAM and BAM files contain information about the alignment of NGS reads to a reference genome. These files are produced by alignment software, which take as input:

the FASTQ files from the sequencing machine (either 1 file for a single-end sequencing sample, or 2 files for a paired-end sequencing sample).

an genomic index, which is typically produced by special software packaged with the alignment software. The genomic index is created from the reference genome. Sometimes the genomic index files for popular reference genomes can be downloaded.

Note: alignment software is typically application specific. In particular, alignment programs for RNA-seq are different than those for genomic DNA sequencing, because in the case of the former, it is expected that reads might fall on exon-exon boundaries. The read will not contain intron sequence, because it is typically the mature, spliced mRNA which is converted to cDNA and sequenced, and the introns are already spliced out of this molecule. This is not a concern for genomic DNA sequencing.

How to import NGS data using Rsamtools

We will use example BAM files from the pasillaBamSubset package to examine the Rsamtools functions:

Specifying: what and which

A number of functions in Rsamtools take an argument param, which expects a ScanBamParam specification. There are full details available by looking up ?scanBamParam, but two important options are:

what - what kind of information to extract?

which - which ranges of alignments to extract?

BAM files are often paired with an index file (if not they can be indexed from R with indexBam), and so we can quickly pull out information about reads from a particular genomic range. Here we count the number of records (reads) on chromosome 4:

We can pull out all the information with scanBam. Here, we specify a new BamFile, and use the yieldSize argument. This limits the number of reads which will be extracted to 5 at a time. Each time we call scanBam we will get 5 more reads, until there are no reads left. If we do not specify yieldSize we get all the reads at once. yieldSize is useful mainly for two reasons: (1) for limiting the number of reads at a time, for example, 1 or 2 million reads at a time, to keep within the memory limits of a given machine, say in the 5 GB range (2) or, for debugging, working through small examples while writing software.

reads<-scanBam(BamFile(filename,yieldSize=5))

Examining the output of scanBam

reads is a list of lists. The outer list indexes over the ranges in the which command. Since we didn’t specify which, here it is a list of length 1. The inner list contains different pieces of information from the BAM file. Since we didn’t specify what we get everything. See ?scanBam for the possible kinds of information to specify for what.

GenomicAlignments description

The GenomicAlignments package is described with:

Provides efficient containers for storing and manipulating short genomic alignments (typically obtained by aligning short reads to a reference genome). This includes read counting, computing the coverage, junction detection, and working with the nucleotide content of the alignments.

This package defines the classes and functions which are used to represent genomic alignments in Bioconductor. Two of the most important functions in GenomicAlignments are:

readGAlignments - this and other similarly named functions read data from BAM files

summarizeOverlaps - this function simplifies counting reads in genomic ranges across one or more files

The summarizeOverlaps function is covered in more depth in the read counting page. summarizeOverlaps is a function which wraps up other functions in GenomicAlignments function for counting reads.

Here we will examine the output of the readGAlignments function, continuing with the BAM file from the pasilla dataset.

Some of our familiar GenomicRanges functions work on GAlignments: we can use findOverlaps, countOverlaps and %over% directly on the GAlignments object. Note that location of ga and gr in the calls below:

gr<-GRanges("chr4",IRanges(700000,800000))(fo<-findOverlaps(ga,gr))# which reads over this range