Figures

Abstract

We report the production and availability of over 7000 fully sequence verified plasmid ORF clones representing over 3400 unique human genes. These ORF clones were derived using the human MGC collection as template and were produced in two formats: with and without stop codons. Thus, this collection supports the production of either native protein or proteins with fusion tags added to either or both ends. The template clones used to generate this collection were enriched in three ways. First, gene redundancy was removed. Second, clones were selected to represent the best available GenBank reference sequence. Finally, a literature-based software tool was used to evaluate the list of target genes to ensure that it broadly reflected biomedical research interests. The target gene list was compared with 4000 human diseases and over 8500 biological and chemical MeSH classes in ∼15 Million publications recorded in PubMed at the time of analysis. The outcome of this analysis revealed that relative to the genome and the MGC collection, this collection is enriched for the presence of genes with published associations with a wide range of diseases and biomedical terms without displaying a particular bias towards any single disease or concept. Thus, this collection is likely to be a powerful resource for researchers who wish to study protein function in a set of genes with documented biomedical significance.

Funding: We would like to thank Sanofi-Aventis and the Breast Cancer Research Foundation for generous support of this project. The project has partly been funded by the German Ministry of Science, NGFN2 (grants 01GR0413 and 01GR0420).

Competing interests: The authors have declared that no competing interests exist.

Introduction

The study of protein function often demands high quality plasmid clones that contain the relevant open reading frames (ORFs) in a format compatible with protein expression. Increasingly, high throughput methods have created the demand for clones that encode a class of proteins of interest or the entire proteome of a species. Functional studies rely on in vivo expression for phenotypic studies or expression and purification by various means for biochemical analysis. Utilizing recombinational cloning vectors and including only the coding sequences, with all untranslated sequences removed, ensures maximum flexibility, including protein expression in a broad experimental range with various tagging options for either end of the protein. In addition, to avoid erroneous or ambiguous results regarding the expressed proteins, it is important that the plasmids are clonal isolates that are fully sequence verified.

For many eukaryotic species, including humans, the number of protein coding sequences exceeds 15,000 genes, making the production of comprehensive sequence-verified ORF clone collections daunting and expensive. In fact, a complete set of source material for expressed genes in humans does not yet exist [1]–[3]. One strategy is for researchers to focus on (a) meaningful subset(s) of genes for functional studies relevant to the biological questions they wish to address. For a human ORF collection the criteria for selecting genes are mostly driven by researchers' interest and clone availability, resulting often in either collections of special interest [4][5], or more ‘random’ lists of genes in collections (RZPD, Invitrogen).

In recent years, a publicly funded project, the Mammalian Gene Collection (MGC), aimed to create for multiple species, but especially for man and mouse, collections of well annotated, fully sequence validated cDNA clones [6]. However, the MGC clones cannot easily be employed directly in functional proteomics experiments because they are in many different vector backbones and contain 5′ and 3′ untranslated sequences. On the other hand, because they are fully sequenced and well annotated, these clones provide an excellent starting point for creating ORF clones. At least one such ORF set has been made so far, although that set comprises pools of clones that are not sequence verified [7][8] and thus has potential ambiguity. Currently, there are also four human ORF collections available from commercial distributors that were clonally isolated and at least partially sequence validated. The recently created ORFeome Collaboration (http://www.orfeomecollaboration.org/) [9] is a project planned to bring to all researchers an ORF clone collection that provides at least one representative ORF clone for all human genes, similar in quality and scope to the MGC clones, with all clones being fully sequence validated.

A limitation of the recombinational cloning vectors used for these ORF clones is that each clone must be committed to one of two non-interchangeable formats: closed (with stop codon; can express native protein) or fusion (no stop codon; enables the addition of carboxyl-terminal fusion peptides). As each format has unique advantages not available for the other, the ideal collection would include both.

Previously we reported the production of two smaller human clone sets in the Creator™ system. One set focused on kinase genes, both well-studied as well as novel or hypothetical ones [5]; the other clone set covered over 1000 genes associated with breast cancer [10], identified in publications using software developed in our group [11]. Here we report the production and complete sequence validation of 7000 clones, HLFEX7000 (>3400 unique human genes in two formats), and the distribution of these genes with respect to their relationship to disease and biological terms in publications in PubMed.

Results

Gene Selection

To make the most useful ORF clone set of the MGC clones, we wished to select an enriched set of genes that is of particular interest to both medicine and biology. In addition, we wished to exclude clones that corresponded to partial gene products and to eliminate redundancy. We first excluded the subset of all MGC clones where the CDS length was less than 90% of the length of the longest corresponding NCBI RefSeq sequence [12]–[14]. We then removed redundancy within the MGC clone set, and picked the clone closest to the longest reference sequence by CDS length as a best MGC representative for each gene. This reduced the number of candidate template clones from 13493 to 7992, representing 7992 genes.

Our discussions with researchers indicated that a focused set of genes in both formats (closed and fusion) would be of more value that a large set in only one format. To ensure that our final gene set (Supplementary Table S1) was enriched for genes related to human diseases without any specific bias, the candidate list was used to query MedGene [11] for genes associated with about 4000 human diseases. As described, MedGene is an automated literature-mining tool, which comprehensively summarizes and estimates the relative strengths of all human gene-disease relationships reported in Medline/PubMed. The result of this query was compared with queries using either all unique genes represented in MGC or all ∼33,000 human genes listed at the time in LocusLink (2004, now: EntrezGene [15]). As shown for a subset of diseases in Figure 1; Table 1 (complete list: Supplementary Table S2), the resulting target list: (a) was highly enriched for the presence of genes with published associations with a wide range of human diseases; (b) had a similar relative ratio among the various diseases to that of both the genome and the MGC; and (c) displayed a broad overlap among different diseases allowing multiple diseases to be addressed with this set of ORF clones.

The clone target list (HFLEX7000) was compared with all human genes (EntrezGene, 2004) and all genes represented by MGC (2004) with respect to published relationships of the genes to human diseases. The targeted genes reveal similar proportionality to the other gene lists but a general enrichment of genes related to diseases (Table 1; Supplementary Table S2).

In addition to disease relationships, we expanded our evaluation to include other search terms relevant for biological research by employing a new database, BioGene, which is based on a similar concept to MedGene. Instead of disease terms, BioGene has a co-citation index for all human genes with all biological and chemical Medical Subject Heading (MeSH) classes (http://www.nlm.nih.gov/mesh), such as “lipids”, “pain” and “tetrahydrofolates”, and is available at http://biogene.med.harvard.edu/BIOGENE/. As shown in Figure 2; Table 2 for 34 biological MeSH classes (complete listing for all analyzed MeSH terms in Supplementary Table S3), the candidate list is enriched for genes linked to all biological MeSH terms in the literature, but proportional to that of the entire MGC clones and to the entire genome.

The clone target list (HFLEX7000) was compared with all human genes (EntrezGene, 2004) and all genes represented by MGC (2004) with respect to published relationships of the genes to all biological MeSH terms and MeSH nodes (34). The targeted genes reveal similar proportionality to the other gene lists but a general enrichment of genes related to MeSH terms (Table 2; Supplementary Table S3).

Thus, the target set of 3557 genes had a similar overall distribution of genes as the MGC and the human genome, but in general has a higher representation of genes that have been linked to both diseases and biological terms in the literature.

Clone Production and Sequence Validation

Production of Clone Collection.

We generated the ORF clones via a processing pipeline that relies heavily on the use of robotics and is supported by the FLEXGene LIMS to produce clones in a highly automated, efficient, and accurate manner as published previously [5], [10], [16], [17].

The process of converting MGC cDNA clones into ORF clones (Figure 3) was initiated by populating our production tracking database (FLEXGene) with the relevant MGC information, e.g. IMAGE ID, GI number, clone sequence, start/stop of CDS, CDS length, gene information, plate and position in IMAGE/MGC collection. All ORFs were normalized to start with ATG, and natural stop codons either to TAG, or, for the format without C-terminal stop, to TTG (Leu). PCR amplicons were gel purified and captured using the In-Fusion™ enzyme into a modified recombinational cloning vector, pGWNcoXho, which increased the efficiency of capturing DNA fragments larger than 1.5 kb [16]. After transformation into E. coli, constructs were clonally selected and isolated. In total, we successfully produced clonal glycerol stocks for 3,528 of 3,557 targeted genes, an overall success rate of ∼98% (Table 3).

The entire production process from the design of primers to production of glycerol stocks is shown. The process started by identifying MGC clones in the available plates and then creating array files along with matching PCR primer order files that included two primers anchored at the 3′ end, one for each format. The primers were used to amplify the ORFs from the matching MGC clones. PCR products were monitored in agarose gels, and products were purified prior to capture via In-Fusion reaction. Competent bacterial strains were transformed with the reaction followed by the robotic isolation of 4 resulting colonies per format, which were used to prepare 15% glycerol stocks. Prior to sequencing a single isolate plate of 96 targets were created. As indicated, step specific results were stored in our LIMS.

DNA sequence analysis and clone acceptance.

Based on a pilot study of 96 genes in which we sequenced all available isolates (4 closed and 4 fusion), we expected that 90% of the clones would yield a valid clone. Thus, it was most efficient to sequence a single isolate for each attempted clone format and return to evaluate additional isolates for any that failed. Clones were accepted if they had no truncation mutations, no frameshift mutations and no more than one single amino acid difference with the reference sequence. Clones with any nucleotide changes in the att- sequences were rejected, as changes in these regions could make it impossible to transfer the ORF into expression vectors (Table 3).

All rejected clones were manually inspected including a BLAST-based comparison to GenBank/EMBL to assess whether the clone matched any other entry for this gene. This step helped to rescue some rejected clones which were found to be ultimately acceptable due to updated MGC sequence entries.

Consistent with the pilot study, 6963 (90%) of the sequenced clones were acceptable based on the above criteria, with the vast majority (6669, or 95.8%) matching exactly to the reference sequence (Table 3, for complete listing of clones see Supplementary Table S4). There were 25 clones that had identical discrepancies in both formats (with and without stop codon). As the two formats were independently produced from the same source clone, this suggests that there may be mistakes in as many as 0.3% of MGC reference sequences.

Discussion

Starting from the MGC resource, we created protein expression ORF clones in two different formats for over 3400 human genes, HFLEX7000, making them the largest contribution of fully sequence verified ORF clones to the ORFeome Collaboration (www.Orfeomecollaboration.org ). The selection criteria for this subset were based on a combination of publication records for the individual gene and their association with biological as well as human disease MeSH terms, as defined by two programs, MedGene and BioGene. We aimed to reflect within this subset a similar distribution as it was present in MGC or the genome, and not to create a functionally or disease specific subset.

To assure the quality of this cDNA clone collection, we fully sequence verified all clones. By employing the appropriate formatted clone, users can add peptide tags to either end of the expressed protein or express protein without any additional amino acids at all. This is important for application reasons, e.g., for some proteins, the C-terminal amino acids may be important functionally (PDZ domain [18]) requiring a translation stop at the natural position, whereas for other proteins the natural N-terminus is relevant (e.g., signal peptides for membrane protein trafficking [19], [20]). Some applications exploit the use of fusion tags at the C-terminus as an experimental readout (e.g., yeast two hybrid [7]), or for capturing expressed proteins and confirming full length expression [21].

We targeted over 3500 unique genes and obtained a fully sequence validated ORF clone for 97% (>3400) of the genes. The strategy of selecting only one clonal isolate per gene for sequencing successfully yielded 90% acceptable clones. This success rate dropped to 80% when second isolates of the failed clones were sequenced, raising questions about the likelihood of success of sequencing additional isolates for clones that failed after two attempts. Also, capture efficiency, as measured by the number of colonies after transformation, was not a predictor of eventual clone success; clones with either high or low colony count numbers were equally likely to be rejected at subsequent steps.

One set of troublesome ORFs identified during PCR and confirmed during sequencing revealed duplication of either the 5′ (near the ATG) or 3′ (near the stop codon) sequences used to design the PCR primer elsewhere in the clone. This led to inappropriate PCR priming and ultimately an inability to clone the gene. Any project using a similar strategy to convert MGC clones into ORF clones might find the same problems, and alternatives, e.g. restriction enzyme/ligation based or fragmented PCR cloning, should be considered for any such ORFs.

In summary, the clones from MGC provide an excellent resource for ORF clone production. The 97% success rate to produce fully sequence validated clones, of which 96% match perfectly to the template clone, underlines that this strategy is feasible in a cost effective manner. Together with our other human ORF sets, notably several hundred DNA binding proteins, over 500 kinases, 1000 breast cancer associated genes, this much broader collection of 3500 genes will be of great benefit to the research community.

Materials and Methods

ORF-specific primer design and strategic 96-well plate organization

ORF sequences were parsed out in the FLEXGene LIMS from the information provided with each MGC clone, and primer sequences were designed using a nearest neighbor algorithm as described earlier [5], [10], [16], [17]. Natural stop codons were either normalized to TAG or replaced with TTG (Leu) in the final primer designs. In addition to ORF-specific, start and stop regions, the 5′ and 3′ primers included fixed sequences that correspond to partial att sequence-specific recombination recognition sites that flank the ORF in the resultant plasmid clones.

Details regarding amplification, purification and capture into a linear version of pDONR221 by In-fusion™ reaction were previously published [16]; PCR success as measured as signal in agarose gels, and capture success as determined as colonies after transformation were stored in FLEXGene LIMS as reported elsewhere [10], [16], [17].

Clone isolation and production of glycerol stocks

Transformations into E. coli (DH5alpha, T1 resistant) were handled in 96-well plates, and robotically plated to 48-sector LB/agar dishes with the appropriate antibiotic selection and grown overnight at 37°C. Colonies were robotically visualized and counted, and single isolates from each sector were picked for inoculation into 1mL growth media (LB/antibiotic) using a customized Megapix robot (Genetix), and 96-well culture blocks were grown overnight at 37°C in the presence of appropriate antibiotic. Inoculated cultures were assayed for growth via OD600 measurement as a measure of transformation efficiency, and aliquots were stored as 15% glycerol stocks in 96-well plate format.

Sequence reactions

High-throughput sequencing was carried out on an Applied Biosystems (ABI) capillary sequencer using dye-terminator and fluorescent cycle sequencing with don3 (TCTTGTGCAATGTAACATCAG) and don5 (CGTTAACGCTAGCATGGA) primers. Raw sequence data were automatically analyzed for quality, vector and repeat content using the pregap4 tool of the Staden Software Package [22]. Reads passing this initial quality control were automatically assembled (gap4 tool of the Staden Software Package). The primer walking method was used to finish insert sequencing, with primers automatically designed by PRIDE [23]. Sequencing was finished when an overall sequence quality of phred40 for the insert sequence and the vector-insert transition was achieved.

Sequence Analysis Software

After the clone sequence was assembled in the Staten package, the assembled sequences were verified using in house developed software [24]. Clones with acceptable linker as well as CDS sequences were collected for distribution. Clones were not accepted if they had discrepancies leading to protein truncations, frame shifts, discrepancies in the linker regions, or more than two amino acid differences with the reference polypeptide. Sequences of all clones started and ended at the BsrGI restriction site (TGTACA) of the vector, allowing QC of in-frame analysis as well as intactness of att recombination sites. Only clones that failed the CDS region evaluation underwent BLAST search against all available GenBank records, and were re-evaluated using matching BLAST hits in pairwise alignments, allowing us to rescue ∼5% of the clones.

Biological and Chemical MeSH Class Analysis. Complete BioGene analysis of biological and chemical MeSH class associations with genes in PubMed, using either all human genes (2004), unique genes in MGC (2004), or HFLEX7000 (targets). Numerical values and percentiles of each class associated with genes are shown (#, %). Relative MeSH Class associations in either MGC or HFLEX7000 to the genome, and in HFLEX7000 to MGC examine a potential bias in MGC or HFLEX7000 towards specific MeSH terms.