By combing through gene databases in a new way, researchers found more than 2000 human and mouse protein variants never before reported. Learn more...

For more than a decade, the complete human and mouse genomes have been
sequenced and catalogued in databases for anyone to search through. But the
list of proteins encoded by those genomes still isn’t complete:

researchers have just discovered more than 2000 new mammalian proteins created
by splicing known genes in new ways (1). The strings of nucleotides that
encode these novel proteins were already listed in databases, but it was
assumed that they weren’t translated into proteins until a team of
Australian scientists took a closer look.

“It all started when we found this bizarre splice form of a protein we were
studying,” said Aude Fahrer of Australia National University, senior author
of the study. “We wondered how many other similar proteins there were.”

The protein was Ncaph2, which is involved in chromosome assembly. The
alternate version of the gene was spliced to lack 17 base pairs of one exon.
Because it doesn’t delete a complete codon, this deletion should shift the
reading frame of the protein and render the entire remainder of the protein
unreadable. Indeed, in gene databases, the alternate splice was listed, but
annotated as “nonsense mediated decay,” indicating that it wouldn’t be
turned into a protein.

Fahrer’s team, however, found that an alternate start codon—in line with the
alternate splice reading frame—rescued the protein, allowing the cell to
create an alternate form after all. They searched the literature for similar
cases of protein isoforms with alternate start codons that could rescue a
frameshift and found just three other published examples.

Thus began the search through the NCBI and ENSEMBL databases to find more,
similar cases in the mouse and human genomes. “We looked for places where
transcripts were misaligned by something not divisible by three,” Fahrer
said. “Then we asked how many of those have a rescue start and stop codon.”

Once additional criteria were applied to the search to ensure that the genes
were from well-sequenced areas of the genome, for example, Fahrer and her
colleagues generated a list of 1849 human and 733 mouse transcripts that
could encode alternate protein isoforms. 80 percent of the transcripts were
incorrectly annotated as non-protein coding in the existing databases.

“To find two thousand new proteins is pretty cool,” Fahrer said. “And chances
are that some of these will be quite important biologically.” In one of the
known cases, for example, the alternate isoform has the opposite effect on a
pathway that the primary protein does.

To obtain proof that these proteins are translated, since bioinformatics
generated only predictions, Fahrer’s group added the predicted protein
information to a mass spectrometry database and reanalyzed some published
mass spectroscopy experiments. Such an experiment is unlikely to turn up all
possible proteins, but the team detected the presence of 26 novel isoforms.
An additional 38 proteins were validated by comparing them to a recently
published list of experimentally verified translation initiation sites.

Fahrer’s team has contacted the ENSEMBL database administrators to ask that
the transcript annotations be updated for the newly discovered proteins.
“What we hope now is that other researchers take a look to see if we have
found a new isoform of their favorite protein; these are now available for
anyone to work on,” said Fahrer.