CRAM: The Genomics Compression Standard

CRAM: A COMMUNITY EFFORT

As genomic sequencing increases around the globe, it becomes vital to store this data efficiently and sustainably. GA4GH’s CRAM file format for genomic data compression tackles this challenge and helps facilitate global collaboration.

Scroll through the videos below to learn how it has benefited existing users and how you can get involved, too.

“CRAM is a fundamental part of the GA4GH suite of standards. It’s how we think about storing DNA sequence and it works as a package with other standards to allow scientists, healthcare professionals, and commercial researchers to access the information they want when they want it.”

Ewan Birney

Chair of GA4GH, Director of EMBL-EBI

WHAT IS CRAM?

CRAM is a database that uses various algorithms to compress the data it stores. Some of these algorithms are universal, others leverage the unique fact that most human genomes are very similar to the reference human genome.

CRAM files store data in columns that are aligned to the reference sequence, allowing users to extract information efficiently from particular subsets of the file on particular chromosomes — one of the major use cases of DNA data.

“Even if an organization is not producing large volumes of data, it can be very beneficial to use the same file formats that other organizations are using. The community has a vital role in making sure that a standard is suitable for everybody and not just one individual or one group.”

“The CRAM file format is essential toward reducing the footprint of genomic data files enabling more efficient, large-scale analyses and queries and also supporting population scale sized projects such as Genomics England. Importantly, the CRAM format was developed by genomics experts within the community to solve a unique challenge in scaling experiments and applications. The adoption of CRAM shows the user benefit when tools are developed by the community for the community.”

Susan Tousi

Illumina, Inc.

Benefits of CRAM

Reduces disk space and storage costs by 30-50%

Interoperable with other industry standards and best practices

Accurately tracks reference genome, improving integration with the field

Easily transfer and share data with collaborators

Continuous community effort to upgrade the format

Free to the community

“It becomes absolutely critical that [genomic] data is in a format that can be easily and effectively shared amongst many investigators. In other words, it really becomes very wasteful if an investigator has to go and reprocess or reformat the data everytime they get a new dataset.”

Stacey Gabriel

The Broad Institute of MIT and Harvard; NIH All of Us Research Program

Start the Conversation at your Institute

There are many ways to get involved. Start the dialogue at your organization or institute around using CRAM.

“Open and encumbered file formats are essential for the science and growing commerce of this industry. The innovations in genomics that will serve humanity are in the interpretation of the data. GA4GH standards allow data and algorithm sharing between institutions, vital for a vibrant informatics ecosystem.”