Summary

Better protocols and decreasing costs have made high-throughput sequencing experiments now accessible even to small experimental laboratories. However, comparing one or few experiments generated by an individual lab to the vast amount of relevant data freely available in the public domain might be limited due to lack of bioinformatics expertise. Though several tools, including genome browsers, allow such comparison at a single gene level, they do not provide a genome-wide view. We developed Heat*seq, a web-tool that allows genome scale comparison of high throughput experiments (ChIP-seq, RNA-seq and CAGE) provided by a user, to the data in the public domain. Heat*seq currently contains over 12,000 experiments across diverse tissue and cell types in human, mouse and drosophila. Heat*seq displays interactive correlation heatmaps, with an ability to dynamically subset datasets to contextualise user experiments. High quality figures and tables are produced and can be downloaded in multiple formats.

1 Introduction

High throughput sequencing is now becoming routine for many biological assays including transcriptome analysis through RNA sequencing (RNA-seq), or transcription factor (TF) binding sites identification through chromatin immuno-precipitation followed by sequencing (ChIP-seq). Additionally, collaborative projects such as Bgee (Bastian et al.,) ENCODE (Bernstein et al., 2012), and Roadmap Epigenomics (Kundaje et al., 2015) have generated genome-wide datasets across hundreds of cell types or tissues. Despite this large data being freely available in the public domain, the lack of computational tools accessible to experimental scientists with no or elementary computational skills prohibits the use of this data to its full potential for discovery.

Though genome browsers, including summary tracks provided by many consortia, are extremely useful to study a few genes, promoters, or single nucleotide polymorphisms, they lack the genome-wide overview. Only a few public resources such as the CODEX database (Sanchez-Castillo et al., 2015a) and the BLUEPRINT GenomeStats tool (Zerbino et al., 2014) allow a genome-wide comparison with the user data. We therefore developed Heat*seq, a free, open source, web application providing fast and interactive comparison against high throughput sequencing experiments in the public domain. Users can upload a processed text file containing either gene expression value (FPKM or TPM), peak coordinates, or peak coordinates and corresponding expression value (CPM) for CAGE. The application provides clustered correlation heatmaps, summarising global similarities between all samples in the dataset and the user sample. Heat*seq provides over 12,000 publicly available genome-wide experiments in human, mouse and drosophila for fast and interactive comparison. In summary, Heat*seq is an interactive web tool that allows users to contextualise their sequencing data with respect to vast amounts of public data in a few minutes without requiring any programming skills.

2.2 Web-application development

Heat*seq is an R shiny open source interactive tool which computes correlation values between the user file and each experiment in a dataset. Detailed user instructions are on the application website.

3 Results

3.1 Application description

Heat*seq tool supports three data types: HeatRNAseq, HeatChIPseq and HeatCAGEseq. Data upload, correlation calculation and heatmap generation takes about a minute. Importantly, users can interactively sub select relevant experiments using the metadata information (e.g. cell type, TF name). The interactive heatmap also allows selecting different clustering methods as well as zooming in and out on the heatmap. The high resolution figures and tables can be downloaded in multiple formats. Thus, Heat*seq provides global overview of relationships between public experiments and the user data. Four user scenarios are discussed below.

Correlations heatmaps from Heat*seq. A. ERα ChIP-seq in MCF7 cells from Zhuang et al. is closer to ENCODE ERa ChlP-seq in T-47D than in ECC-1 cells. B. BRF1 and RNA PolIII bind tRNA genes, but nor BRF2. C. c-MYC ChIP-seq in H1-hESC from UT-A and Stanford show low correlation. The color key next to B is for A, B and C. D. Two erythroblast RNA-seq samples from BLUEPRINT are closely related to endothelial cells.

3.2 User scenarios

3.2.1 User data quality control

We compared a Neocortex, 10 days post-partum (Ray et al., 2015) RNA-seq sample with Bgee mouse RNA-seq data using HeatRNAseq. The top five correlation values (PCC > 0.9) correspond to Bgee brain samples (Table 2). Thus, Heat*seq can be used as a fast data quality check for next generation sequencing data.

Top 5 experiments in the Bgee mouse RNA-seq dataset with the highest correlation value with a Neocortex RNA-seq experiment.

3.2.2 Cell context identification

An oestrogen receptor (ER) alpha ChlP-seq in MCF7 cells (Zhuang et al., 2015) comparison to the ENCODE TFBS dataset by sub-selecting ENCODE ER ChlP-seq experiments revealed that the binding pattern of ERa in MCF7 cells was more similar to its binding pattern in T-47D cells than in ECC-1 cells (Figure 1A). MCF7 and T-47D were derived from mammary tumours while ECC-1 is an endometrial cell line.

3.2.3 New hypotheses by data integration

CpG islands (CGI) from the UCSC (Karolchik et al., 2004) comparison to HeatChIPseq found that RNA polymerase II and TAF1 (Table 3) were enriched at CGIs, as about 50% of human gene promoters contain a CGI (Illingworth and Bird, 2009). Interestingly we identified factors avoiding CGIs including MAFK, GATA3 and ZNF274.

Similarly, tRNA promoters were highly correlated with RNA polymerase III, and its co-factors BDP1, RPC155, and BRF1 using HeatChIPseq. Interestingly, comparison with BRF family data revealed that BRF1, but not BRF2 was bound at tRNA genes (Figure 1B).

Top and bottom 5 experiments in the ENCODE human TF ChIP-seq dataset with the highest and lowest correlation value with CpG island coordinates.

3.2.4 Public data assessment

Heat*seq can be used to assess data in the public domain, highlighted by two examples below amongst others:

A MYC ChIP-seq in H1-hESC cells does not cluster with other ENCODE MYC ChIP-seq experiments (Figure 1C), including H1-hESC sample from a different experimental group (Devailly et al., 2015).

Two out of seven erythroblast RNA-seq samples from the Blueprint Epigenome consortium are more correlated with endothelial cells than with the rest of the erythroblast samples (Figure 1D).

4 Conclusion

With Heat*seq, comparing RNA-seq, ChIP-seq or CAGE experiments to hundreds of publicly available datasets becomes a trivial task. Researchers can now investigate the relationships between various high-throughput sequencing experiments fast and interactively without requiring any programming skills. Such analysis can assess data quality, cell variability, and generate novel regulatory hypotheses.

Funding

AJ is a Chancellor's fellow and AJ lab is supported by institute strategic funding from Biotechnology and Biological Sciences Research Council (BBSRC, BB/J004235/1). GD is funded by the People Programme (Marie Curie Actions FP7/2007-2013) under REA grant agreement No PCOFUND-GA-2012-600181. Conflict of Interest: none declared1.

Acknowledgements

We would like to thank Barry Horne for the R shiny server set-up and administration and the Edinburgh R user group (EdinbR) for their advice and support.