April 8, 2014

ENCODE Transcription factor tracks

I was trying to overlap the differentially methylated CpGs at the transcription factor binding sites for my data. But I am puzzled to see that the ENCODE consortium which has spent huge amount of money on performing experiments and publishing papers, did not pay much attention to giving details of the tracks and their naming structure as a quick reference that can be easily found. After browsing a lot on the net, I got the clue from this webpage about the details of the transcription factor chip experiments.

In this post I summarize the details about how to download all the tracks and understanding the naming of the tracks.

Transcription factor related ChIP-seq tracks are available for individual download from this link. However, for those wanting to perform analysis from all the tracks, it is painful to download each track and merge the tracks ensuring the identity of each track. Fortunately, I found that Sartor lab has created a bed file merging all the ENCODE TF tracks. This file also ensured the identity of each row of the track by labelling its source. This could be easily converted into 'GRanges' object either by custom functions in base R or using import function from rtracklayer package from Bioconductor.

Another issue with these tracks is their naming structure. While the consortium ensured that every track name includes all the necessary information, its structure was not documented anywhere (that could be easily obtained). After few hours of browsing and collecting the information, I understood the structure of the TF track name as follows:

List of tracks and the related metadata for the tracks is available for download from this webpage (click the files.txt link)

The downloaded file is not uniformly tabulated. So, I had to fiddle with it to make it look uniform

Further I extracted only the details that matter to understand the file/track name. You may download this file from here.

Here I will explain the name of one track as an example:
Track name = wgEncodeAwgTfbsSydhK562CjunIfna6hUniPk
Every track name includes the following elements:

Other possible values appearing in the track names are (Haib, UChicago, Uta, UW)

Cell line used=K562

There are over 92 types of cell lines used. So, this part is highly variable. Details of the cell types used is available from this page.

Antibody used = c-Jun

There are over 190 antibodies used (=TFs probed). This is also a vairable part of the track name. In some cases, they have provided the catalog number of the antibody purchased.

Treatment=ifna6h (Means cells are treated with IFNA for 6 hrs)

This part gives details of the cell treatment. When cells are
not subjected to any treatement, there is no mention of this part.
Overall, there are 29 variables at this place. When the treatment is for 36h, it is taken as standard and not mentioned in the name of the track.

Algorithm used=UniPk (Uniform peak calling). Common for every track!

I hope, this information is useful for others doing a similar analysis.

Update (26th June 2014)

Here are few more links that give additional information about the ENCODE transcription factor ChIP experiments.

For those interested in the comprehensive list of transcription factors across various genomes this link may be useful. However these are predicted transcription factors and not human curated. http://www.bioguo.org/AnimalTFDB/index.php