A method and system for quantifying the relative abundance of gene transcripts in a biological sample. One embodiment of the method generates high-throughput sequence-specific analysis of multiple RNAs or their corresponding cDNAs (gene transcript imaging analysis). Another embodiment of the method produces...http://www.google.com/patents/US5840484?utm_source=gb-gplus-sharePatent US5840484 - Comparative gene transcript analysis

A method and system for quantifying the relative abundance of gene transcripts in a biological sample. One embodiment of the method generates high-throughput sequence-specific analysis of multiple RNAs or their corresponding cDNAs (gene transcript imaging analysis). Another embodiment of the method produces a gene transcript imaging analysis by the use of high-throughput cDNA sequence analysis. In addition, the gene transcript imaging can be used to detect or diagnose a particular biological state, disease, or condition which is correlated to the relative abundance of gene transcripts in a given cell or population of cells. The invention provides a method for comparing the gene transcript image analysis from two or more different biological samples in order to distinguish between the two samples and identify one or more genes which are differentially expressed between the two samples.

isolating a representative population of the cDNA copies and producing therefrom a first cDNA library, wherein a selected set of random primers is used in the generation of the first cDNA library;

identifying a first set of gene transcripts from the first library and determining cDNA sequences corresponding to the gene transcripts;

processing the cDNA sequences corresponding to the first gene transcripts in a programmed computer in which a database of reference transcript sequences indicative of reference cDNA sequences is stored, to generate a first identified sequence value for each of the first gene transcripts, where each said identified sequence value is indicative of a sequence annotation and a degree of match between one of the first gene transcripts and at least one of the reference cDNA sequences; and

processing each said identified sequence value to generate first final data values indicative of a number of times each first identified sequence value is present in the first cDNA library.

2. The method of claim 1, wherein the first mixture of mRNA is obtained from a human cell.

3. The method of claim 1, wherein the mixture of mRNA is obtained from a combination of two or more samples.

isolating a representative population of cDNA copies and producing therefrom a second cDNA library, wherein a selected set of random primers is used in the generation of the second cDNA library;

identifying a second set of gene transcripts from the second library and determining cDNA sequences corresponding to the gene transcripts;

processing the cDNA sequences corresponding to the second gene transcripts in a programmed computer in which a database of reference transcript sequences indicative of reference biological sequences is stored, to generate a second identified sequence value for each of the second gene transcripts, where each said second identified sequence value is indicative of a sequence annotation and a degree of match between one of the second gene transcripts and at least one of the reference gene transcripts; and

processing each said second identified sequence value to generate second final data values indicative of a number of times each second identified sequence value is present in the second cDNA library.

6. The method of claim 5, wherein the second library comprises at least 5,000 cDNAs.

7. The method of claim 5, further comprising:

processing the first final data values and the second final data values to generate ratio values of gene -transcripts, each of said ratio values indicative of differences in numbers of gene transcripts between the first mixture of mRNA and the second mixture of mRNA.

8. The method of claim 7, further comprising:

subtracting the first final data values from the second final data values to identify one or more genes that are differentially expressed.

9. The method of claim 7, wherein the first mixture of mRNA is obtained from a sample extracted from a healthy human patient and the second mixture of mRNA is obtained from a sample extracted from an unhealthy human patient.

10. The method of claim 1, wherein the first library comprises at least 5,000 cDNAs.

11. A method of producing a gene transcript image, comprising the steps of

obtaining a mixture of mRNA from a biological specimen;

making cDNA copies of the mRNA, wherein the cDNA copies of mRNA are made using a selected set of random primers;

inserting the cDNA copies into a suitable vector and transfecting suitable host strain cells with the vector and growing clones, each clone representing a unique mRNA;

isolating a representative population of at least 5,000 recombinant clones;

identifying amplified cDNAs from each clone in the population by a sequence-specific method which identifies a gene from which the unique mRNA was transcribed;

determining a number of times each gene is represented within the population of clones as an indication of relative abundance; and

listing genes and their relative abundance in order of abundance, thereby producing a gene transcript image

The present invention is in the field of molecular biology; more particularly, the present invention describes methods of high-throughput cDNA sequencing and transcript analysis.

BACKGROUND OF THE INVENTION

For the convenience of the reader, the references referred to in the text are listed numerically in parentheses. These numbers correspond to the numerical references listed in the appended bibliography. By these references, they are hereby expressly incorporated by reference herein.

Nucleic acids (DNA and RNA) carry within their structure the hereditary information and are therefore the prime molecules of life. Nucleic acids are found in all living organisms including bacteria, fungi, viruses, plants and animals. It is of interest to determine the relative abundance of nucleic acids in different cells, tissues and organisms over time under various conditions, treatments and regimes.

It is estimated that the 23 pairs of human chromosomes encode approximately 100,000 genes. All dividing cells in the body contain the same set of 23 pairs of chromosomes. The differences between different types of cells can be accounted for by the differential expression of the 100,000 or so genes found on the same 23 chromosomes. Many of the most fundamental questions of biology could be answered by a simple understanding of which genes are expressed and at what relative abundance in different cells.

Previously, the art has only provided for the analysis of a few known genes at a time by standard molecular biology techniques such as PCR, northern blot analysis, or other types of DNA probe analysis such as in situ hybridization. Each of these methods allows one to analyze the expression of only known genes or small numbers of genes at a time. (1-12)

Studies of the number and types of genes whose synthesis is induced or otherwise regulated during developmental processes such as cell activation, differentiation, aging, viral transformation, morphogenesis, and division have been pursued for many years, using a variety of methodologies. One of the earliest methods was to compare the proteins made in a given cell, tissue, organ system, or even organism both prior to and subsequent to the differentiation process of interest. Such comparisons were typically made using 2-dimensional gel electrophoresis, wherein each protein could be, in principle, identified and quantified as a discrete signal. In order to positively identify each signal, each discrete signal must be excised from the membrane and subjected to protein sequence analysis using Edman degradation. Unfortunately, most of the signals were present in quantities too small to obtain a reliable sequence, and many of those signals contained more than one discrete protein. An additional difficulty is that many of the proteins were blocked at the amino-terminus, further complicating the sequencing process.

Analyzing differentiation at the gene transcription level has overcome many of these disadvantages and drawbacks, since the power of recombinant DNA technology allows amplification of signals containing very small amounts of material. The most common method, called "hybridization subtraction", involves preparation of mRNA from the biological sample before (B) and after (A) the developmental process of interest, subtracting sample B from sample A by hybridization, and construction of a cDNA library from the non-hybridizing mRNA fraction of sample A. Many different groups have used this strategy successfully, and a variety of procedures have been published and improved upon using this same basic scheme (1-12).

All of these techniques have particular strengths and weaknesses, however there are still some limitations and undesirable aspects of these methods: First, the time and effort required to construct such libraries is quite large. Typically, a trained molecular biologist might expect construction and characterization of such a library to require 3 to 6 months, depending on his level of skill, experience, and luck. Second, the resulting subtraction libraries are typically inferior to the libraries constructed by standard methodology. A typical conventional cDNA library should have a clone complexity of at least 106 clones, and an average insert size of 1-3 kB. In contrast, subtracted libraries can have complexities of 102 or 103 and average insert sizes of 0.2 kBp. Therefore, there can be a significant loss of clone and sequence information associated with such libraries. Third, this approach allows the researcher to capture only the genes induced in sample A relative to sample B; not vice-versa, nor does it easily allow comparison to a third sample of interest (C). Fourth, this approach requires very large amounts (hundreds of micrograms) of "driver" mRNA (sample A), which significantly limits the number and type of subtractions that are possible since many tissues and cells are very difficult to obtain in large quantities.

Fifth, the resolution of the subtraction is dependent upon the physical properties of DNA:DNA or RNA:DNA hybridization. The ability of a given sequence to find a hybridization match is dependent on its unique CoT value, which is in turn a function of the number of copies (concentration) of the particular sequence, multiplied by the time of hybridization. It follows that for sequences which are abundant, hybridization events will occur very rapidly (low CoT value), while rare sequences will form duplexes at very high CoT values. Unfortunately, the rare genes, or those present at abundances of 104 -107, tend to be the most interesting sequences, and those in which an investigator would likely be most interested. CoT values which allow such rare sequences to form duplexes are difficult to achieve in a convenient time frame, therefore hybridization subtraction is simply not a useful technique with which to study relative levels of rare mRNA species. Sixth, this problem is further complicated by the fact that duplex formation is also dependent on the nucleotide base composition for a given sequence. Those sequences rich in G+C form stronger duplexes than those with high contents of A+T, therefore the former sequences will tend to be removed selectively by hybridization subtraction. Seventh, it is possible that hybridization between nonexact matches can occur. When this happens, the expression of a homologous gene may "mask" expression of a gene of interest, artificially skewing the results for that particular gene.

The present invention has none of the drawbacks of the prior art. The present invention avoids these problems by providing a method to quantify the relative abundance of multiple gene transcripts in a given biological sample by the use of high-throughput sequence-specific analysis of individual RNA's or their corresponding cDNAs.

The present invention offers several advantages over current protein discovery methods which attempt to isolate individual proteins based upon biological effects. The method of the instant invention provides for detailed comparisons of cell profiles revealing numerous changes in the expression of individual transcripts.

The instant invention provides several advantages over previous subtraction methods including a more complete library analysis (106 to 107 clones as compared to 103 clones) which allows identification of low abundance messages as well as enabling the identification of messages which either decrease or decrease in abundance. These large libraries are very routine to make in contrast to the libraries of previous methods. In addition homologues can easily be distinguished with the method of the instant invention.

High resolution maps of gene expression can be used directly as a diagnostic profile or to identify disease-specific genes for the development of more classic diagnostic approaches.

This process is defined as gene transcript frequency analysis. The resulting quantitative analysis of the gene transcripts is defined as comparative gene transcript analysis.

SUMMARY OF THE INVENTION

The method is a method of analyzing a library of biological sequences comprising the steps of (a) producing a library of biological sequences; (b) generating a set of data values, where each of the data values in said set is indicative of a different one of the biological sequences of the library; (c) processing the data values in a programmed computer in which a data base of reference data values indicative of reference sequences is stored, to generate an identified sequence value for each of the data values, where each said identified sequence value is indicative of a degree of match between a different one of the biological sequences of the library and at least one of the reference sequences; and (d) processing each said identified sequence value to generate final data values indicative of the number of matches between the biological sequences of the library and ones of the reference sequences.

In a further embodiment, the method includes producing a gene transcript image analysis, by (a) isolating an mRNA population from a biological sample; (b) identifying genes from which the mRNA was transcribed by a sequence-specific method; (c) determining the numbers of mRNA transcripts corresponding to each of the genes; and (d) using the mRNA transcript numbers to determine the relative abundance of mRNA transcripts within the population of mRNA transcripts, where data determining the relative abundance values of mRNA transcripts is the gene transcript image analysis.

In a further embodiment, the relative abundance of the gene transcripts is determined by comparing the gene transcript numbers of genes in a single cell type or alternatively in different cell types.

In a further embodiment, the method includes a system for analyzing a library of biological sequences including a means for receiving a set of data values, where each of the data values is indicative of a different one of the biological sequences of the library; and a means for processing the data values in a computer system in which a data base of reference data values indicative of reference sequences is stored, wherein the computer is programmed with software for generating generate an identified sequence value for each of the data values, where each said identified sequence value is indicative of a degree of match between a different one of the biological sequences of the library and at least one of the reference sequences, and for processing each said identified sequence value to generate final data values indicative of number of matches between the biological sequences of the library and ones of the reference sequences.

In a further embodiment, a first value of the degree of match is indicative of an exact match, and a second value of said degree of match is indicative of a non-exact match.

In essence, the invention is a method and system for quantifying the relative abundance of gene transcripts in a biological sample. The invention provides a method for comparing the gene transcript image analysis from two or more different biological samples in order to distinguish between the two samples and identify one or more genes which are differentially expressed between the two samples. One embodiment of the method generates high-throughput sequence-specific analysis of multiple RNAs or their corresponding cDNAs: gene transcript imaging analysis. Another embodiment of the method produces the gene transcript imaging analysis by the use of high-throughput cDNA sequence analysis. In addition, the gene transcript imaging can be used to detect or diagnose a particular biological state, disease, or condition which is correlated to the relative abundance of gene transcripts in a given cell or population of cells.

In a class of embodiments, the invention is a method for producing a set ("library") of biological sequences. Biological sequences herein defined include: DNA, RNA, cDNA, proteins, amino acids, carbohydrates and the like. The method includes generating a set of data values, each of said data values indicative of a different one of the biological sequences of the library; processing the data values in a programmed computer, in which a data base of reference data values (indicative of reference sequences) is stored, to identify each sequence (i.e., generate an identified sequence value for each of the data values, where each identified sequence value is indicative of a degree of match between a different one of the biological sequences of the library and at least one of the reference sequences); and processing the identified sequence values to generate final data values (typically, a sorted list of identified sequences and corresponding abundance values) indicative of the number of matches between the sequences of the library and ones of the reference sequences.

Table 2 is a list of isolates from the HUVEC cDNA library arranged according to abundance from U.S. patent application Ser. No. 08/137,951 filed Oct. 14, 1993 which is hereby incorporated by reference. The column labeled "number" refers to the sequence number in the Sequence Listing, i.e., the HUVEC sequence identification number. Isolates that have not been sequenced are not present in the Sequence Listing but are indicated in Table 2. The column labeled "1" refers to the library from which the cDNA clone was isolated: "H" for HUVEC cells. The column labeled "d" (an abbreviation for designation) contains a letter code indicating the general class of the sequence. The letter code, as presented in Table 1, is as follows: N-no homology to previously identified nucleotide sequences, E-exact match to a previously identified nucleotide sequence, U-the sequence of the isolate has not been determined, M-mitochondrial DNA sequence, O-homologous, but not identical to a previously identified nucleotide sequence, H-homologous, but not identical to a previously identified human gene, R-Repetitive DNA sequence, V-vector sequence, only, S-sequence not yet determined, I-matches an Incyte clone (part of an assemblage), X-matches an EST, and A-a poly A tract. The column labeled "f" refers to the distribution of the gene product encoded by the cDNA. The letter code as presented in Table 1 is as follows: C-non-specific, P-cell/tissue specific and U-unknown. The column labeled "z" refers to the cellular localization of the gene product encoded by the cDNA. The letter code is indicated in Table 1. The column labeled "r" refers to function of the gene product encoded by the cDNA. The letter code for the "r" column is presented in Table 1. The column labeled "c" refers to the certainty of the identification of the clone. The column labeled "entry" gives the NIH GENBANK locus name, identifying a nucleotide sequence homologous to the indicated sequence number. The column labeled "descriptor" provides a plain English explanation of the identity of the sequence corresponding to the NIH GENBANK locus name in the "entry" column. The "descriptor" column also indicates when unreadable sequence was present or when templates were skipped.

Table 3 is a comparison of the top 15 most abundant gene transcripts in normal and activated macrophage cells.

Table 4 is a list of the top 15 activated cDNAs activated macrophage cells compared to normal cells as determined by library subtraction.

FIG. 1 is a diagram representing the sequence of operations performed by "abundance sort" software in a class of preferred embodiments of the inventive method.

FIG. 2 is a block diagram of a preferred embodiment of the system of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention provides a method to quantify the relative abundance of gene transcripts in a given biological sample by the use of high-throughput sequence-specific analysis of individual RNAs or their corresponding cDNAs (or alternatively, of data representing other biological sequences). This process is denoted herein as gene transcript imaging. The quantitative analysis of the relative abundance for a given gene transcript or set of gene transcripts is denoted herein as "gene transcript analysis" (or "gene transcript imaging analysis" or "gene transcript frequency analysis"). The present invention allows one to obtain a profile for gene transcription in any given population of cells or tissue from any type of organism. The invention can be applied to obtain a profile of a sample consisting of a single cell (or clones of a single cell), or of many cells, or of tissue more complex than a single cell.

For example gene transcript frequency analysis can be used to differentiate tumor cells from normal cells or activated macrophages from inactivated macrophages.

In an alternative embodiment, gene transcript frequency analysis is used to differentiate between cancer cells which respond to anti-cancer agents and those which do not respond. Potential anti-cancer agents include tamoxifen, vincristine, vinblastine, podophyllotoxins, etoposide, tenisposide, cisplatin, biologic response modifiers such as interferon, Il-2 GM-CSF, enzymes, hormones and the like.

In yet another embodiment, gene transcript frequency analysis is used to differentiate between liver cells isolated from patients treated and untreated with FIAU.

In yet another embodiment, gene transcript frequency analysis is used to differentiate between brain tissue from patients treated and untreated with lithium.

In a further embodiment, gene transcript frequency analysis is used to differentiate between cyclosporin and/or FK506 treated cells.

In a further embodiment, gene transcript frequency analysis is used to differentiate between viral infected, including HIV, human cells and uninfected human cells. Gene transcript frequency analysis is also used to compare HIV resistant cells to infected or HIV sensitive cells.

In a further embodiment, gene transcript frequency analysis is used to differentiate between bronchial lavage fluids from healthy and unhealthy patients.

In a further embodiment, gene transcript frequency analysis is used to differentiate between cell, plant, microbial and animal mutants and wild-type species. Such mutants could be deletion mutants which do not produce a gene product and/or point mutants which produce a less abundant message and could include mineral nutrition, metabolism, biochemical and pharmacological mutants isolated by means known to those skilled in the art.

In a further embodiment, gene transcript frequency analysis is used for an interspecies comparative analysis which would allow for the selection of better animal models. In this embodiment, human and animal (such as a mouse) cells are treated with a specific test agent. The relative sequence abundance of each cDNA population is determined. If the animal test system is a good model, the homologous genes should change expression similarly. If side effects are detected with the drug, a detailed transcript abundance analysis will be performed. Models will be selected by basic physiological changes.

In a further embodiment, gene transcript frequency analysis is used in a clinical setting to give a specific gene transcript profile of a patient from a patient sample (for example, where the patient sample is a blood sample). In particular, gene transcript frequency analysis is used to give a high resolution gene expression profile of a diseased state or condition.

In a further embodiment, gene transcript frequency analysis is used in a motif analysis to look for specific regions of proteins of interest. Such proteins include specific cell surface or membrane receptors, transcription factors and the like.

In essence, the method utilizes high-throughput cDNA sequencing to identify specific transcripts of interest. The generated cDNA and deduced amino acid sequences are then extensively compared with GENBANK and other sequence data banks as described below. The method offers several advantages over current protein discovery methods which try to isolate individual proteins based on biological effect. Here, detailed comparisons of activated and inactivated cell profiles reveal numerous changes in the expression of individual transcripts. After it is determined if the sequence is an exact match, a similar sequence or entirely dissimilar, the sequence is entered into a data base. Next, the numbers of copies of cDNA corresponding to a particular genes are tabulated. Although this can be done by human hand from a printout of all entries, a computer program is a useful way to tabulate this information The numbers of copies are divided by the total number of sequences in the data set, to obtain a relative abundance of transcripts for each corresponding gene. The list of represented genes can then be sorted by abundance in the cDNA population. A multitude of additional types of comparisons or dimensions are possible and described below in detail.

An alternate method of producing a gene transcript image includes the steps of obtaining a mixture of test mRNA and providing a representative array of unique probes whose sequences are complementary to at least some of the test mRNAs. Next, a fixed amount of the test mRNA is added to the arrayed probes. The test mRNA is incubated with the probes for a sufficient time to allow hybrids of the test mRNA and probes to form. The mRNA-probe hybrids are detected and the quantity determined. The hybrids are identified by their location in the probe array. The quantity of each hybrid is summed to give a population number. Each hybrid quantity is divided by the population number to provide a set of relative abundance data termed a gene transcript image analysis.

I CONSTRUCTION OF cDNA LIBRARIES

The human lymphoma U-937 cDNA library is commercially available from Stratagene (catalogue #937207. Stratagene, 11099 M. Torrey Pines Rd., La Jolla, Calif. 92037). The Stratagene library was prepared by Stratagene essentially as described. It was prepared by purifying poly(A+)RNA (mRNA) from U-937 cells and then enzymaticly synthesizing double stranded complementary DNA (cDNA) copies of the mRNA by priming with oligo dT. Synthetic adapter oligonucleotides were ligated onto the ends of the cDNA enabling its insertion into the lambda vector. The U-937 library was constructed using the Uni-ZAP™ vector system (Stratagene), allowing high efficiency unidirectional (sense orientation) lambda library construction and the convenience of a plasmid system with blue/white color selection to detect clones with cDNA insertions.

The human monocyte THP-1 cDNA library was custom constructed by Stratagene (Stratagene, 11099 M. Torrey Pines Rd., La Jolla, Calif. 92037). Poly(A+)RNA (mRNA) was purified from THP-1 cells (cultured 48 hr with 100 nm TPA and 4 hr with 1 μg/ml LPS). cDNA synthesis was primed separately with both oligo dT and random hexamers and the two cDNA copies were treated separately. Synthetic adapter oligonucleotides were ligated onto cDNA ends enabling its insertion into Uni-ZAP™ vector system (Stratagene), allowing high efficiency unidirectional (sense orientation) lambda library construction and the convenience of a plasmid system with blue/white color selection to detect clones with cDNA insertions. Finally, the two libraries were combined into a single library by mixing equal numbers of bacteriophage.

The human endothelial cell, HUVEC, cDNA library was custom constructed by Stratagene (Stratagene, 11099 M. Torrey Pines Rd., La Jolla, Calif. 92037). Poly(A+)RNA (mRNA) was purified separately from the two batches of induced HUVEC cells. cDNA synthesis was also separated into the two batches, primed separately with both oligo dT and random hexamers. Synthetic adaptor oligonucleotides were ligated onto cDNA ends enabling its insertion into Uni-ZAP™ vector system (Stratagene), allowing high efficiency unidirectional (sense orientation) lambda library construction and the convenience of a plasmid system with blue/white color selection to detect clones with cDNA insertions.

The human mast cell HMC-1 cDNA library was custom constructed by Stratagene (Stratagene, 11099 N. Torrey Pines Rd., La Jolla, Calif. 92037) using mRNA purified from cultured HMC-1 cells. The library was prepared by Stratagene essentially as described. The human mast cell (HMC-1) cDNA library was prepared by purifying poly(A+)RNA (mRNA) from human mast cells and then enzymaticly synthesizing double stranded complementary DNA (cDNA) copies of the mRNA. Synthetic adaptor oligonucleotides were ligated onto the ends of the cDNA enabling its insertion into the lambda vector. The HMC-1 library was constructed using the Uni-ZAP™ vector system (Stratagene), allowing high efficiency unidirectional (sense orientation) lambda library construction and the convenience of a plasmid system with blue/white color selection to detect clones with cDNA insertions.

The THP-1, U-937 cDNA, HUVEC and HMC-1 libraries can be screened with either DNA probes or antibody probes and the pBluescript® phagemid (Stratagene) can be rapidly excised in vivo. The phagemid allows the use of a plasmid system for: easy insert characterization, sequencing, site-directed mutagenesis, the creation of unidirectional deletions and expression of fusion proteins. The custom-constructed library phage particles-were infected into E. coli host strain XL1-Blue® (Stratagene), which has a high transformation efficiency, increasing the probability of obtaining rare, under-represented clones in the cDNA library.

Besides the Uni-ZAP™ vector system by Stratagene disclosed therein, it is now believed that other similarly unidirectional vectors also can be used. For example, it is believed that such vectors include but are not limited to DR2 (clontech), HXLOX (U.S. Biochemical)

For inter-library comparisons, the libraries must be prepared in similar manners. Certain parameters appear to be particularly important to control for. One such parameter is the method of isolating mRNA. It is important to remove DNA and heterogeneous nuclear RNA under the same conditions. Size fractionation of cDNA must be carefully controlled. The same vector preferably should be used for preparing libraries to be compared. At the very least, the same type of vector (e.g., unidirectional vector) should be used to assure a valid comparison. A unidirectional vector may be preferred because it is easier to analyze the output. However, with a unidirectional vector, there is dropout from the wrong direction ligations.

It is preferred to prime only with oligo dT unidirectional primer in order to obtain one only clone per mRNA transcript when obtaining transcript. However, it is recognized that employing a mixture of dT and random primers can also be advantageous because such a mixture affords more freedom when gene discovery also is a goal. Experiments have indicated that no obvious bias is introduced when random primers are employed. Similar effects can be obtained with DR2 from Clontech, HXLOX (US Biochemical) and also from Invitrogen and Novagen. These vectors have two requirements. First, there must be primer sites for commercially available primers such as T3 or M13 reverse primers. Second, the vector must accept inserts up to 10 kb.

It also is important to sample randomly a significant population of clones. Data has been generated with 5,000 clones; however, if very rare genes are to be obtained and/or their relative abundance determined, as many as 100,000 clones may need to be sampled. Size fractionation of cDNA must be carefully controlled.

The examples below are provided to illustrate the subject invention. These examples are provided by way of illustration and are not included for the purpose of limiting the invention.

II ISOLATION OF cDNA CLONES

The phagemid forms of individual cDNA clones were obtained by the in vivo excision process, in which the host bacterial strain was coinfected with both the lambda library phage and an f1 helper phage. Proteins derived from both the library-containing phage and the helper phage nicked the lambda DNA, initiated new DNA synthesis from defined sequences on the lambda target DNA and created a smaller, single stranded circular phagemid DNA molecule that included all DNA sequences of the pBluescript® plasmid and the cDNA insert. The phagemid DNA was secreted from the cells and purified, then used to re-infect fresh host cells, where the double stranded phagemid DNA was produced. Because the phagemid carries the gene for B-lactamase, the newly-transformed bacteria are selected on medium containing ampicillin.

Phagemid DNA was also purified using the QIAwell-8 Plasmid Purification System from QIAGEN® DNA Purification System (QIAGEN Inc., 9259 Eton Ave., Chattsworth, Calif. 91311). This product line provides a convenient, rapid and reliable high-throughput method for lysing the bacterial cells and isolating highly purified phagemid DNA using QIAGEN anion-exchange resin particles with EMPORE™ membrane technology from 3M in a multiwell format. The DNA was eluted from the purification resin already prepared for DNA sequencing and other analytical manipulations.

III SEQUENCING OF cDNA CLONES

The cDNA inserts from random isolates of the U-937 and THP-1 libraries were sequenced in part. Methods for DNA sequencing are well known in the art. Conventional enzymatic methods employ DNA polymerase Klenow fragment, Sequenase™ or Taq polymerase to extend DNA chains from an oligonucleotide primer annealed to the DNA template of interest. Methods have been developed for the use of both single- and double stranded templates. The chain termination reaction products are usually electrophoresed on urea-acrylamide gels and are detected either by autoradiography (for radionuclide-labeled precursors) or by fluorescence (for fluorescent-labeled precursors). Recent improvements in mechanized reaction preparation, sequencing and analysis using the fluorescent detection method have permitted expansion in the number of sequences that can be determined per day (such as the Applied Biosystems 373 DNA sequencer and Catalyst 800). Currently constructing 5000 clone libraries and randomly selecting about 2,000-2,500 clones with 30-50% usable,

Using the nucleotide sequences derived from the cDNA clones as query sequences (sequences of a Sequence Listing), databases containing previously identified sequences are searched for areas of homology (similarity). Examples of such databases include Genbank and EMBL. We next describe examples of two homology search algorithms that can be used, and then describe the subsequent computer-implemented steps to be performed in accordance with preferred embodiments of the invention.

In the following description of the computer-implemented steps of the invention, the word "library" denotes a set (or population) of biological sample sequences. A "library" can consist of cDNA sequences, RNA sequences, protein sequences, or the like, which characterize a biological sample. The biological sample can consist of cells of a single human cell type (or can be any of the other above-mentioned types of samples). We contemplate that the sequences in a library have been determined so as to accurately represent or characterize a biological sample (for example, they can consist of representative cDNA sequences from clones of a single human cell).

In the following description of the computer-implemented steps of the invention, the expression "data base" denotes a set of stored data which represent a collection of sequences, which in turn represent a collection of biological reference materials. For example, a data base can consist of data representing many stored cDNA sequences which are in turn representative of human cells infected with various viruses, human cells of various ages, cells from various species of mammals, and so on. For another example, a data base can consist of data representing many stored protein sequences which are representative of human cells infected with various viruses, human cells of various ages, cells from various species of mammals, and so on.

In preferred embodiments, the invention employs a computer programmed with software (to be described) for performing the following steps:

(a) processing data indicative of a library of cDNA sequences (generated as a result of high-throughput cDNA sequencing) to determine whether each sequence in the library matches a cDNA sequence of a data base of cDNA sequences (and if so, identifying the data base entry which matches the sequence);

(b) for some or all entries of the data base, tabulating the number of sequences of the library which match each such entry (although this can be done by human hand from a printout of all entries, we prefer to perform this step using computer software to be described below), thereby generating a set of "abundance numbers"; and

(c) dividing each abundance number by the total number of sequences in the library, to obtain a relative abundance number for each data base entry.

The list of represented data base entries (or genes corresponding thereto) can then be sorted by abundance in the cDNA population. A multitude of additional types of comparisons or dimensions are possible.

For example (to be described below in greater detail), steps (a) and (b) can be repeated for two different libraries (sometimes referred to as a "target" library and a "subtractant" library). Then, for each data base entry, a "ratio" value is generated by dividing the abundance number (for that entry) for the target library, by the abundance number (for that entry) for the subtractant library. Each ratio value can then be divided by the total number of sequences in one or both libraries, to obtain a relative ratio value for each data base entry.

In variations on step (a), the library consists of nucleotide sequences derived from cDNA clones. Examples of data bases which can be searched for areas of homology (similarity) in step (a) include the commercially available data bases known as Genbank and EMBL.

One homology search algorithm which could be used to implement step (a) is the algorithm described in the paper by D. J. Lipman and W. R. Pearson, entitled "Rapid and Sensitive Protein Similarity Searches", Science, 227, 1435 (1985). In this algorithm, the homologous regions are searched in a two step manner. In the first step, the highest homologous regions are determined by calculating a matching score using a homology score table. The parameter "Ktup" is used in this step to establish the minimum window size to be shifted for comparing two sequences. Ktup also sets the number of bases that must match to extract the highest homologous region among the sequences. In this step, no insertions or deletions are applied and the homology is displayed as an initial (INIT) value.

In the second step, the homologous regions are aligned to obtain the highest matching score by inserting a gap in order to add a probable deleted portion. The matching score obtained in the first step is recalculated using the homology score Table and the insertion score Table to an optimized (OPT) value in the final output.

DNA homologies between two sequences can be examined graphically using the Harr method of constructing dot matrix homology plots (Needleman, S. B. and Wunsch, C. O., J. Mol. Biol 48:443 (1970)). This method produces a two-dimensional plot which can be useful in determining regions of homology versus regions of repetition.

However, in a class of preferred embodiments, step (a) is implemented by processing the library data in the commercially available computer program known as the Inherit 670 Sequence Analysis System, available from Applied Biosystems Inc. (of Foster City, Calif.), including the software known as the Factura software (also available from Applied Biosystems Inc.). The Factura program preprocesses each library sequence to "edit out" portions thereof which are not likely to be of interest.

In the algorithm implemented by the Inherit 670 Sequence Analysis System, the Pattern Specification Language (developed by TRW Inc.) is used to determine regions of homology. There are three parameters that determine how the sequence comparisons are run: window size, window offset, and error tolerance. Using a combination of these three parameters, a data base (such as a DNA data base) can be searched for sequences containing regions of homology and the appropriate sequences are scored with an initial value. Subsequently, these homologous regions are examined using dot matrix homology plots to determine regions of homology versus regions of repetition. Smith-Waterman alignments can be used to display the results of the homology search.

The Inherit software can be executed by a Sun computer system programmed with the UNIX operating system.

In preferred embodiments, the processed data generated by the Inherit software (representing identified sequences) are input into, and further processed by, a Macintosh personal computer (available from Apple) programmed with an "abundance sort and subtraction analysis" computer program (to be described below).

The abundance sort and subtraction analysis program (also denoted as the "abundance sort" program) classifies identified sequences from the cDNA clones as to whether they are exact matches (regions of exact homology), homologous human matches (regions of high similarity, but not exact matches), homologous non-human matches (regions of high similarity present in species other than human), or non matches (no significant regions of homology to previously identified nucleotide sequences stored in the form of the data base).

With reference again to the step of identifying matches between library sequences and data base entries, in cases where the library consists of deduced protein and peptide sequences, the match identification can be performed in a manner analogous to that done with cDNA sequences. A protein sequence is used as a query sequence and compared to the previously identified sequences contained in a data base such as the Swiss/Prot data base or the NBRF Protein database to find homologous proteins. These proteins are initially scored for homology using a homology score Table (Orcutt, B. C. and Dayoff, M. O. Scoring Matrices, PIR Report MAT-0285 (February 1985)) resulting in an INIT score. The homologous regions are aligned to obtain the highest matching scores by inserting a gap which adds a probable deleted portion. The matching score is recalculated using the homology score Table and the insertion score Table resulting in an optimized (OPT) score. Even in the absence of knowledge of the proper reading frame of an isolated sequence, the above-described protein homology search may be performed by searching all 3 reading frames.

Peptide and protein sequence homologies can also be ascertained using the Inherit 670 Sequence Analysis System in an analogous way to that used in DNA sequence homologies. Pattern Specification Language and parameter windows are used to search protein databases for sequences containing regions of homology which are scored with an initial value. Subsequent examination with a dot-matrix homology plot determines regions of homology versus regions of repetition.

The ABI Assembler application software, part of the INHERITS DNA analysis system (available from Applied Biosystems, Inc., Foster City, Calif.), can be employed to create and manage sequence assembly projects by assembling data from selected sequence fragments into a larger sequence. The Assembler software combines two advanced computer technologies which maximize the ability to assemble sequenced DNA fragments into Assemblages, a special grouping of data where the relationships between sequences are shown by graphic overlap, alignment and statistical views. The process is based on the Meyers-Kececioglu model of fragment assembly (INHERITS™ Assembler User's Manual, Applied Biosystems, Inc., Foster City, Calif.), and uses graph theory as the foundation of a very rigorous multiple sequence alignment engine for assembling DNA sequence fragments.

Next, with reference to FIG. 1, we describe in more detail the "abundance sort" program which implements above-mentioned "step (b)" to tabulate the number of sequences of the library which match each data base entry (the "abundance number" for each data base entry).

FIG. 1 is a flow chart of a preferred embodiment of the abundance sort program. A source code listing of this embodiment of the abundance sort program is set forth below as Appendix A. In the Appendix A implementation, the abundance sort program is written using the FoxBASE programming language commercially available from Microsoft Corporation. The subroutine names specified in FIG. 1 correspond to subroutines listed in Appendix A.

With reference again to FIG. 1, the "Identified Sequences" are data values representing each sequence of the library and a corresponding identification of the data base entry (if any) which it matches. In other words, the "Identified Sequences" are data values representing the output of above-discussed "step (a)."

FIG. 2 is a block diagram of a system for implementing the invention. The FIG. 2 system includes library generation unit 2 which generates a library and asserts an output stream of data values indicative of the sequences comprising the library. Programmed processor 4 receives the data stream output from unit 2, and processes this data in accordance with above-discussed "step (a)" to generate the Identified Sequences. Processor 4 can be a processor programmed with the commercially available computer program known as the Inherit 670 Sequence Analysis System and the commercially available computer program known as the Factura program (both available from Applied Biosystems Inc.) and with the UNIX operating system.

Still with reference to FIG. 2, the Identified Sequences are loaded into processor 6 which is programmed with the abundance sort program. Processor 6 generates the Final Data values indicated in both FIGS. 1 and 2.

With reference to FIG. 1, the abundance sort program first performs an operation known as "Tempnum" on the Identified Sequences, to discard all of the Identified Sequences except those which match data base entries of selected types. For example, the Tempnum process can select Identified Sequences which represent matches of the following types with data base entries: "exact" matches (exact matches with data base entries representing human genes); "homologous" matches (approximate, but not exact, matches with data base entries representing human genes), "other species" matches (exact and/or approximate matches with data base entries representing genes present in species other than human), or "no" matches (no significant regions of homology with data base entries representing previously identified nucleotide sequences).

The data values selected during the "Tempnum" process then undergo a further selection (weeding out) operation known as "Tempred." This operation can, for example, discard all data values representing matches with selected data base entries.

The data values selected during the "Tempred" process are then classified according to library, during the "Tempdesig" operation. It is contemplated that the Identified Sequences can represent sequences from a single library, or from two or more libraries.

Consider first the case that they represent sequences from a single library. In this case, all the data values determined during "Tempred" undergo sorting in the "Templib" operation, further sorting in the "Libsort" operation, and finally additional sorting in the "Temptarsort" operation. For example, these three sorting operations can sort the identified sequences in order of decreasing "abundance number" (to generate a list of decreasing abundance numbers, each abundance number corresponding to a data base entry, or several lists of decreasing abundance numbers, with the abundance numbers in each list corresponding to data base entries of a selected type) with redundancies eliminated from each sorted list. In this case, the operation identified as "Cruncher" can be bypassed, so that the "Final Data" values are the organized data values produced during the "Temptarsort" operation.

We next consider the case that the data values produced during the "Tempred" operation represent sequences from two libraries (which we will denote the "target" library and the "subtractant" library). For example, the target library may consist of cDNA sequences from clones of a diseased cell, while the subtractant library may consist of cDNA sequences from clones of the diseased cell after treatment by exposure to a drug. For another example, the target library may consist of cDNA sequences from clones of a cell from a young human, while the subtractant library may consist of cDNA sequences from clones of a cell from the same human (after he or she has aged).

In this case, the "Tempdesig" operation routes all data values representing the target library for processing in accordance with "Templib" (and then "Libsort" and "Temptarsort"), and routes all data values representing the subtractant library for processing in accordance with "Tempsub" (and then "Subsort" and "Tempsubsort"). For example, the consecutive "Templib," "Libsort," and "Temptarsort" sorting operations can sort identified sequences from the target library in order of decreasing abundance number (to generate a list of decreasing abundance numbers, each abundance number corresponding to a data base entry, or several lists of decreasing abundance numbers, with the abundance numbers in each list corresponding to data base entries of a selected type) with redundancies eliminated from each sorted list. The consecutive "Tempsub," "Subsort," and "Tempsubsort" sorting operations would sort identified sequences from the subtractant library in order of decreasing abundance number (to generate a list of decreasing abundance numbers, each abundance number corresponding to a data base entry, or several lists of decreasing abundance numbers, with the abundance numbers in each list corresponding to data base entries of a selected type) with redundancies eliminated from each sorted list. The data values output from the "Temptarsort" operation typically represent sorted lists from which a histogram could be generated in which position along one (e.g., horizontal) axis indicates abundance number (of target library sequences), and position along another (e.g., vertical) axis indicates data base entry (e.g., human or non-human gene type). Similarly, the data values output from the "Tempsubsort" operation typically represent sorted lists from which a histogram could be generated in which position along one (e.g., horizontal) axis indicates abundance number (of subtractant library sequences), and position along another (e.g., vertical) axis indicates data base entry (e.g., human or non-human gene type).

The data values (sorted lists) output from the Tempsubsort and Temptarsort sorting operations are combined during the operation identified as "Cruncher." The "Cruncher" process identifies pairs of corresponding target and subtractant abundance numbers (both representing the same data base entry), and divides one by the other to generate a "ratio" value for each pair of corresponding abundance numbers, and then sorts the ratio values in order of decreasing ratio value. The data values output from the "Cruncher" operation (the Final Data values in FIG. 1) typically determine a sorted list from which a histogram could be generated in which position along one axis indicates a ratio of abundance numbers (for corresponding sequences from target and subtractant libraries), and position along another axis indicates data base entry (e.g., gene type).

Preferably, the Cruncher operation also divides each ratio value by the total number of sequences in one or both of the target and subtractant libraries. The resulting lists of "relative" ratio values generated by the Cruncher operation would be useful for many medical, scientific, and industrial applications. Also preferably, the output of the Cruncher operation is a set of lists, each list representing a decreasing sequence of ratio values for a different selected subset of data base entries.

In one example, the abundance sort program of the invention tabulates the numbers of mRNA transcripts corresponding to each gene identified in a data base. These numbers are divided by the total number of clones sampled. The results of the division reflect the relative abundance of the mRNA transcripts in the cell type or tissue from which they were obtained. Obtaining this final data set is referred to herein as "gene transcript image analysis."

The resulting relative abundance data shows exactly what proteins are upregulated and downregulated. Table 5 shows a comparison of the most common mRNA transcripts between the cell types. Gene transcript image analysis can be done for different cell types and for the same cell type at different stages of development or activation (for example, to compare common mRNA transcripts in one cell type. (See Tables 3-5). Also, such abundance data can be obtained on a patient sample and used to diagnose conditions associated with macrophage activation.

A gene transcript imaging analysis (or multiple gene transcript imaging analyses) can be used in toxicological studies. For example, the differences in gene transcript imaging analyses before and after treatment can be assessed for patients on placebo and drug treatment. This method effectively screens for markers to follow in clinical use of the drug. Often clinicians have difficulty ascertaining the difference between pathology caused by the disease being treated and by the drug being administered. A gene transcript imaging analysis before treatment can be compared with a gene transcript imaging analysis after treatment to isolate new, unwanted pathology caused by the drug. Such a detailed analysis of patients having hepatitis can help distinguish new hepatic injury caused by drug treatment.

More detailed comparisons can be easily prepared. Actual relative abundances for different samples can be reported in separate columns, or one group of numbers can be divided by the other group to highlight the most extreme changes in relative abundance. Such computations can be performed by humans but are more efficiently performed by computer.

Additional types of comparisons can be made. Cells from different species can be compared by comparative gene transcript analysis to screen for specific differences. Such testing aids in the selection and validation of an animal model for the commercial purpose of drug screening or toxicological testing of drugs intended for human or animal use. When the comparison between animals of different species is shown in columns for each species, we refer to this as a interspecies comparison.

Other embodiments of the invention employ other data bases, such as a random peptide data base, a polymer data base, a synthetic oligomer data base, or a oligonucleotide data base of the type described in U.S. Pat. No. 5,270,170, issued Dec. 14, 1993 to Cull, et al., PCT International Application Publication No. WO 9322684, published Nov. 11, 1993, PCT International Application Publication No. WO 9306121, published Apr. 1, 1993, or PCT International Application Publication No. WO 9119818, published Dec. 26, 1991. These four references (whose text is incorporated herein by reference) include teaching which may be applied in implementing such other embodiments of the present invention.

REFERENCES

1. Nucleic Acids Research 19:7097-7104 (1991)

2. Nucleic Acids Research 18:4833-4842 (1990)

3. Nucleic Acids Research 18:2789-2792 (1989)

4. European Journal of Neuroscience 2:1063-1073 (1990)

5. Analytical Biochemistry 187:364-373 (1990)

6. Genet Annal Techn Appl 7:64-70 (1990)

7. GATA 8(4): 129-133 (1991)

8. Proc. Natl. Acad. Sci. USA 85:1696-1700 (1988)

9. Nucleic Acids Research 19:1954 (1991)

10. Proc. Natl. Acad. Sci. USA 88:1943-1947 (1991)

11. Nucleic Acids Research 19:6123-6127 (1991)

12. Proc. Natl. Acad. Sci. USA 85:5738-5742 (1988)

13. Nucleic Acids Research 16:10937 (1988)

Various modifications and variations of the described method and system of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific preferred embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments.

TABLE 1__________________________________________________________________________NSEC Clone Descriptors__________________________________________________________________________The following are the current descriptors used to describe each clone inthe NSEC database (where information isavailable):Library (L): Denotes the cDNA library of clone originDesignation (D): Describes general category of the clone (e.g. match to a prior sequence, new clone, or non-useable clone)Certainty (C): Denotes clones where the designataion is ambiguous and further work is required to establish true identitySpecies (S): Indicates species from which database match was derivedOrientation (O): A (<) indicates match was found in the opposite orientation (limited to certain analyses)Distribution (F): Describes whether the clone is found in all tissues and cells, or whether its expression is limited its occurrenceLocalization (Z): Describes where in the cell the protein is normally foundFunction (R): Describes the functional class of the protein__________________________________________________________________________