Adam Siepel, Brona Brejova and colleagues at several other institutions will report in an upcoming issue of the the journal Genome Research on their discoveries. They’ve found about 300 previously unidentified human genes, and found extensions of several hundred genes already known using computers to compare portions of the human genome with those of other mammals.

Our current understanding is that the human genome contains about 20,000 protein encoding genes. When such a low number came out, many people were shocked. This new method to compare genomes and scan for genes by Sipela and crew shows there still could be many more genes that have been missed using previous methods. These methods are very effective at finding genes that are widely expressed but may miss those that are expressed only in certain tissues or at early stages of embryonic development.

“…set out to find genes that have been “conserved” — that are fundamental to all life and that have stayed the same, or nearly so, over millions of years of evolution.

The researchers started with “alignments” discovered by other workers — stretches up to several thousand bases long that are mostly alike across two or more species. Using large-scale computer clusters, including an 850-node cluster at the Cornell Center for Advanced Computing, the researchers ran three different algorithms, or computing designs — one of which Siepel created — to compare these alignments between human, mouse, rat and chicken in various combinations.

After eliminating predictions that matched already known genes, the researchers tested the remainder in the laboratory, proving that many of the genes could in fact be found in samples of human tissue and could code for proteins. The researchers were sometimes able to identify the proteins by comparison with databases of known proteins. The discovered genes mainly have to do with motor activity, cell adhesion, connective tissue and central nervous system development, functions that might be expected to be common to many different creatures.

The entire project, from building and testing the mathematical models to running final laboratory tests, took about three years, Siepel said. “

So just how did they do it? One of the genes that was used to train the algorithm was, GRIA2, a gene that makes a receptor for neurotransmitters,

“The portions of the gene that code for amino acids that make up a protein change in different ways from other parts of the genome, so computer algorithms can use these distinctive patterns of evolutionary change to identify new genes that have been missed by other methods. A portion of GRIA2 is shown here in an alignment of the genomes of several species, beneath a graph of the computer analysis. Peaks in the graph identify exons (regions that are expressed), separated by introns (non-coding regions). When a cell reads the gene to make a protein the introns are edited out.”

The paper is available in advance under this title, “Targeted discovery of novel human exons by comparative genomics.” Comparative genomics is so cool, I’d like to see this study replicated when with gorilla, Neandertal, chimpanzee, macaque and human genomes. But we gotta wait until several of those are completed.