Assessing the Gene Ontology

Abstract

Motivation: The Gene Ontology (GO) is heavily used in systems biology but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored.

Results: We report that GO annotations are stable over short periods with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their .functional identity. over time, with 20% of genes not matching to themselves (by semantic similarity) after two years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally risen in humans. Finally, we discovered that many entries in protein interaction databases are due to the same published reports that are used for GO annotations with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks.

Publication:Submitted

Contact: paul[at]msl.ubc.ca or JGillis[at]cshl.edu for assistance with the data.

Data files

The following files for human genes are intended to assist researchers who wish to check their own data for the types of effects we report in the paper. The files are tab-delimited. Genes are referenced by NCBI IDs or official symbols, and publications by PubMed IDs.

HIPPIE PPIN – The protein interaction data used in sections 3.3 and 3.4.

frac_confound_go_103 – Each GO group’s confoundedness for our final data point for GO. These data are plotted in Figure 3A. “NaN” occurs where there was division by zero.

frac_confound_con_103 – Number of functions shared by gene pairs from the PPIN, and the number of functions confounded for our final data point for GO (edition 103). These data are plotted in Figure 3B.