This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) database is a centralized repository of information on the DNA-binding preferences of proteins as determined by universal protein-binding microarray (PBM) technology. Each entry for a protein (or protein complex) in UniPROBE provides the quantitative preferences for all possible nucleotide sequence variants (‘words’) of length k (‘k-mers’), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In this update, we describe >130% expansion of the database content, incorporation of a protein BLAST (blastp) tool for finding protein sequence matches in UniPROBE, the introduction of UniPROBE accession numbers and additional database enhancements. The UniPROBE database is available at http://uniprobe.org.

INTRODUCTION

A comprehensive understanding of gene-expression regulation requires the thorough characterization of transcription factor (TF)–DNA binding properties. TFs play central roles in transcriptional regulatory networks by binding specific DNA sequences and activating or repressing gene expression. Consequently, TF–DNA-binding specificities have broad impact on cell physiology and development and in evolution (1,2).

Advances in DNA microarray synthesis and the development of protein-binding microarray (PBM) technology (3,4) led to the development of universal PBMs (5), which allow high-throughput measurement of comprehensive data on protein–DNA binding specificities, resulting in large data sets requiring curation and searchability. The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) (6) database was created to satisfy these requirements. Please refer to the original UniPROBE publication (6) for a description of major differences between UniPROBE and the JASPAR (7), TRANSFAC (8) and PAZAR (9) databases. The original UniPROBE publication (6) also provides a detailed description of PBM technology and data types.

Since its inception 2 years ago, the UniPROBE database has continued to expand in size, utility and user base. UniPROBE previously housed data for 177 non-redundant proteins (6). That number has recently grown to over 400 non-redundant proteins or protein complexes, with additional, unpublished PBM data sets already planned for future deposition. Currently, the UniPROBE database averages 933 unique visitors per month (classified by IP address) from over 40 different countries and 3558 page views per month. UniPROBE is the standard for curating universal PBM data, and we invite other researchers generating universal PBM data to contact us about depositing their data in UniPROBE.

DATABASE ADDITIONS

UniPROBE has more than doubled in size since its introduction in January 2009 (6) (Table 1). As of this writing, in addition to the data deposited from the initial set of four publications (5,10–12), PBM data are included from six newer publications (13–18) with additional published (19) and soon to be published data currently in planning for deposition. The new additions include data on TFs from Caenorhabditis elegans, Saccharomyces cerevisiae, Mus musculus and Homo sapiens. The UniPROBE database now houses PBM data for 415 individual proteins or protein complexes, nearly all of which are TFs, corresponding to 404 non-redundant proteins or protein complexes.

UniPROBE database contents, with indication of additions in PBM data sets since its introduction in 2009

NEW BLASTP SEARCH FEATURE

In the latest version of UniPROBE, the available online search features have been augmented with a new search tool that permits a user to perform a blastp (20) search of a protein sequence of interest (the ‘query protein’) against all protein sequences in the UniPROBE database (the ‘subject proteins’). This feature incorporates NCBI’s Protein–Protein BLAST tool (21), blastp v.2.2.23+, for accurate and efficient alignments. This blastp tool returns a list of links to the Details page for each subject protein that either exactly matches or is similar to the query protein(s) according to user-specified search parameter settings. Links from the Details pages allow further exploration and links to download the PBM data for the matching proteins.

Query protein sequences may be entered manually into a web-page form or uploaded as a text file. The sequence is parsed using fail-safe rules to interpret the format. Multiple sequences can be processed in batch either by specifying one sequence per line, or by entry of FASTA-formatted sequences, which may cross multiple lines but are separated by header lines. Numbers and unnecessary white-space are stripped from the sequence prior to performing the search.

For the subject proteins, the blastp search tool uses a database comprising all the clone insert sequences corresponding to all the PBM experiments with data curated in UniPROBE. For example, consider a search for the human TF GATA4, which is not currently in UniPROBE. Running the blastp tool on the human GATA4 sequence with default parameter settings (Figure 1) results in eight hits, four from yeast and four from mouse, all with the GATA DNA-binding domain (Figure 2). Among the hits, the tool correctly retrieves two hits to Gata3, which is represented in the database by two proteins: the full-length TF and just the DNA-binding domain. The blastp search parameter settings (E-value threshold, species, substitution matrix and word size) are passed directly to a local instance of NCBI’s blastp executable.

Results from blastp search of all protein sequences in UniPROBE using the human full-length GATA4 protein sequence as the query.

Results are output with the sequence matches within matching subject proteins rendered with yellow highlighting on all the residues within the confines of the alignment. Also provided is the offset of the first aligned residue of the query protein. As defined by blastp, the score provided is a measure of similarity, and the E-value is the number of expected matches if the subject protein sequences were generated randomly.

UNIPROBE ACCESSION NUMBERS

A significant new feature is the addition of UniPROBE accession numbers. Each TF PBM data set now has its own UniPROBE accession number, regardless of whether or not its protein is unique in the database. Accession numbers are five digits prefixed with ‘UP’ (abbreviation for ‘universal PBM’), e.g. UP00350. Accession numbers are returned as part of the search results and are also listed on each protein’s Details page. A user can use the ‘Quick Search’ tool to find TFs by accession number. Accession numbers can be requested prior to publication of new PBM data sets, such as for unpublished PBM data sets in new article submissions.

OTHER NEW FEATURES

New to this version of UniPROBE is the inclusion of PBM data for protein complexes. This functionality was implemented to accommodate homodimer and heterodimer data for bHLH TFs from C. elegans (13). This feature allows the Details page to render data sets for the protein of interest and for each of the proteins with which the protein of interest dimerizes.

The UniPROBE statistics cited here were derived with the aid of several minor but useful enhancements. It is now possible to use ‘Text Search’ to find TFs by publication; TFs can be searched by species using the same tool. The search results now include the total number of TFs returned. To easily distinguish between separate, published PBM data sets for the same protein, a reference to the publication for each separate data set has been added to the bottom of all TF Details pages, along with the array design number(s).

FUTURE DIRECTIONS

Future updates planned for UniPROBE include additional user and administrative tools. Currently in development is a negative control sequence generator which, given an E-score threshold indicative of DNA-binding preference, will generate random sequence of user-specified length that does not include any 8-mer with scores exceeding the given threshold for user-selected TFs and species in UniPROBE. Another planned feature is the display of sequence alignments resulting from the blastp searches of UniPROBE. Also under development are administrative tools to allow for self-deposition and automated pre-publication UniPROBE accession number requests. The template for the Details page will be generalized to support self-deposition of PBM data for protein complexes. These tools and others will be facilitated, and system performance will generally improve, with the implementation of a newly designed database schema. As always, we continue to encourage user registration and feedback for error reports and feature requests, some of which motivated the development of the new features described here.

AVAILABILITY AND LICENSE

All data hosted by the PBM database are freely available for distribution at the database website. The sequences of the 60-mer DNA probes synthesized on the custom-designed universal arrays are available under the terms of the academic research use license available at http://thebrain.bwh.harvard.edu/uniprobe/academic-license.php.

FUNDING

Funding for open access charge: National Institutes of Health (grant number R01 HG003985 to M.L.B.).