CSAM: Compressed SAM Format

Alistair Moffat
Department of Computing and Information Systems
The University of Melbourne,
Victoria 3010, Australia.

Andrew Turpin
Department of Computing and Information Systems
The University of Melbourne,
Victoria 3010, Australia.

Status

Bioinformatics, 32(24):3709-3716, 2016.

Abstract

Motivation:
Next generation sequencing machines produce vast
amounts of genomic data. For the data to be useful,
it is essential that it can be stored and manipulated efficiently.
This work responds to the combined challenge of compressing genomic
data, while providing fast access to regions of interest, without
necessitating decompression of whole files.

Results: We describe CSAM (Compressed SAM format), a
compression approach offering lossless and lossy compression for
SAM files.
The structures and techniques proposed are suitable for representing
SAM files, as well as supporting fast access to the compressed
information.
They generate more compact lossless representations than BAM,
which is currently the preferred lossless compressed
SAM-equivalent format; and are self-contained, that is, they do
not depend on any external resources to compress or decompress SAM
files.