Data Processing Pipelines

Overview

The ENCODE Data Coordinating Center Uniform Processing Pipelines are designed to create high-quality, consistent, and reproducible data. Pipelines are composed of discrete steps that can represent an algorithm, a software tool, or a file format manipulation. These steps are applied to the primary data (generated from an experimental assay) to produce visualizable data. The ENCODE Data Coordinating Center has developed data processing pipelines for major assay types generated by the project: RNA-seq, RAMPAGE1, ChIP-seq, DNase-seq, ATAC-seq2, and WGBS.

Pipeline versioning

A processing pipeline is a set of analysis steps that may be versioned as changes are made to the code and software components. Entire pipelines may also be versioned.

There are major and minor step revisions: Minor step revisions are backwards compatible and should produce directly comparable results; these are annotated as step versions. Major step revisions result in a new pipeline version, though not all steps will change when a pipeline is versioned. Whenever a major change is made, all downstream steps must be versioned as well, as the inputs to downstream steps are dependent on the output of the new upstream steps. To visualize if a pipeline or an analysis step has a new version, click on the blue step boxes found in pipeline graph.

An important goal motivating the development of uniform processing pipelines is to maximize the degree to which data can be compared within and across assays. All data should be processed by directly comparable methods, and all result files of a given type (e.g. alignment bams) should be compatible. If older versions of results were released but new analysis steps were later adopted, an experiment may have two versions of the same file once the data is reprocessed.

RNA-seq measure RNA abundance, and RNA-seq data can be interpreted in terms of transcriptional activity and RNA stability. RNA-seq experiments contribute to our understanding of how RNA-based mechanisms impact gene regulation and thus disease and phenotypic variation. Since RNA populations are diverse, different assays are optimized to measure different RNA species, and the data from these assays are processed in specific ways.

RAMPAGE (RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression) is a very accurate sequencing approach to identify transcription start sites (TSSs) at base-pair resolution, to quantify their expression, and to characterize their transcripts. RAMPAGE uses direct cDNA evidence to link specific genes and their TSSs.

Transcription factor ChIP-seq (TF ChIP-seq) specifically looks at proteins, such as sequence-specific transcription factors, which are thought to associate with specific DNA sequences to influence the rate of transcription. Histone ChIP-seq is sensitive to the histone content of chromatin, specifically to the incorporation of particular post-translational histone modifications in chromatin. The pipelines take input fastqs from replicated experiments and controls as well as reference fasta's for the initial read mapping. Both piplines share the same mapping steps, but differ in the way the signal and peaks are called and in the subsequent statistical treatment of replicates.

DNA accessibility assays such as DNase-seq, ATAC-seq, FAIRE-seq, and MNase-seq are common assays that support the goals of the ENCODE project. DNase-seq maps DNase I hypersensitive sites, which is considered to be an accurate method of identifying regulatory elements. ATAC-seq (Assay for Transposase Accessible Chromatin with high-throughput sequencing) is viewed as an alternative to DNase-seq and MNase-seq; it probes DNA accessibility with hyperactive Tn5 transposase, which inserts sequencing adapters into accessible regions of chromatin.

Whole-genome bisulfite sequencing (WGBS) is used to discover methylation patterns at single-base resolution. Bisulfite treatment is used
to convert unmethylated cytosines into uracils, but leaves methylated cytosines unchanged. After mapping bisulfite sequencing reads against a C-->U transformed genome, this pipeline can extract the CpG, CGH and CHH methylation patterns genome-wide.