I am looking for studies on how many protein isoforms have different functions, preferably in human. We know that a great many, if not most, of human genes are alternatively spliced and that many produce different protein isoforms. Has anyone looked at how many of these isoforms have different cellular functions? If someone could point me to a published paper, that would be great.

If no such study has been made, can anyone recommend a database from which this information could be extracted? GeneOntology is gene based so the information cannot be found there. Genes will be annotated to specific terms, not their protein isoforms. Also, I would need to be able to do this in a high throughput manner, I am not interested in specific proteins but in what percentage of all isoforms have different functions.

Ideally, I would like to be able to extract, for every human gene, the list of the different protein isoforms it encodes and whether their functions differ, or at least what those functions are.

I don't think isoform specific information is in a database yet. Testing isoform specific functions is tricky since not all isoforms are expressed in all cell types. Recently have researchers been looking at splicing genome wide and figuring out which splice forms are in certain cells (differential exon usage in RNA-seq to infer different isoforms). As you have encountered, functional info is typically summarized by gene. Swiss-Prot mentioned by @MattDMo is one of the few db where info is hand-curated. If isoform specific function was annotated, it would probably appear there first
–
yingwApr 3 '13 at 22:14

3 Answers
3

It seems that most functional annotation these days is inferred from sequence similarity to previously annotated genes/proteins: this is certainly true of high-throughput functional annotation. It's hard to know how many layers of inference there are between your query sequence and an actual experiment verifying the (possibly tissue- or condition-specific) function of a gene/protein. But alas, I digress...

One confounding factor for is that alternatively spliced isoforms share a lot sequence in common, which means that they might share best hits when searching against a database of annotated transcripts and end up getting assigned the same GO terms.

But as others have mentioned, I think the biggest limiting factor you will encounter is the low-throughput step: in many cases the experimental work required to characterize the functions of alternative isoforms simply has not yet been done.

Yes, that is the problem in a nutshell. Most (all AFAIK) high throughput experiments tag genes, not proteins and assign function to the gene as opposed to a specific isoform.
–
terdonApr 27 '13 at 15:10

I would take a look at UniProt to begin with. It's my go-to whenever I need to look for isoforms or functional data, so it should help you as well. You could start by comparing the information in Swiss-Prot to TrEMBL, which are manually and automatically annotated, respectively. One issue that I've come across in my research on isoforms is that there often isn't a great deal known about them, as studies may have focused on just one or a few out of many possible. One way to get around this could be to use structure/function predictions based on sequence, then try to predict variations by looking at what's added or missing in the various isoforms.

Thanks, but I'm interested in whole proteome data. Uniprot has no parseable information on isoforms apart from how many a given protein has and their sequence. Sometimes the function field has some information, sometimes it doesn't.
–
terdonApr 2 '13 at 18:30

I'd tend to agree there's very little data on this, esp given how many isoforms there are. I believe that even some cases of pseudogenes may have function as a protein. I'd bet that most of them have some unique functional aspects.

I think we'd all agree we can't prove that they do in all cases, but let me sketch out how these different functions might arise or are known.

Protein Isoforms usually come from two causes. The first being that there are two very similar copies of the same gene, from a gene duplication event. The second coming from alternative splicing.

As you can imagine, mapping all this out in just one case took years of work. Complete characterization of all human isoforms would take a long time.

Isoforms from gene duplication have many of the same properties as splice variant isoforms; they can have completely different regulatory machinery - popping up in different biological states than the main gene and they can vary in sequence a little bit or a lot. It has been implied that these gene duplicates persist in the genome because they quickly find a new niche role. This paper shows that 30% of 195 new genes that have shown up since two strains of fly have diverged are necessary for viability.

Evolution doesn't let genes sit around because they are completely identical; they can be removed as easily as they show up, only differential function will keep an isoform in the mRNA repertoire for a prolonged time.

Thanks shigeta but I would not call proteins coded for by different genes isoforms (and neither does the wikipedia page you linked to, it talks about alleles of the same gene). They are paralogs or orthologs depending on their evolutionary history. I would only use isoforms to describe products of the same gene. I am also assuming that most isoforms have different functions, hence my question. I am looking for data to back that assumption up.
–
terdonApr 27 '13 at 15:09