OperonDB is based on a method that detects and analyzes conserved pairs of adjacent genes located on the same DNA strand in two or more bacterial genomes. For each conserved gene pair, OperonDB estimates the probability that the genes belong to the same operon by taking into account alternative possibilities that explain why the genes are adjacent in several genomes. Prediction of operon structure depends on conservation of gene order and orientation in two or more species. Genes within an operon often have related functions. Operon structure provides information about the function of genes within an operon.

Domain fusion allows for the prediction of a functional relationship between two distinct genes in an organism depending on an instance where those two genes are fused as a continuous sequence in another organism. The fusion gene can indicate a relationship between genes that are independent in another organism. The fused gene suggests a relationship between the component genes which is not necessarily due to sequence similarity. Fusion links frequently relate genes of the same functional category. The function of an uncharacterized gene within a fusion link can be inferred from the known function of the gene to which it is fused.

Phylogenetic profiling infers the function of a gene from another gene with known function with an identical pattern of presence and absence across a set of phylogenetically distributed genomes. The profile of a gene consists of the pattern of occurrence of its orthologs across a set of genomes. Orthologs here are used as defined in the COG database. Two genes are assumed to be functionally related if the correlation between their profiles is greater than would be expected by chance.

Related function between a pair of genes can be inferred from the conservation of proximity between the two genes across many genomes. The probability that neighboring genes encode proteins within the same biological pathway depends on the number of genomes in which the proximity of the genes is conserved. The conserved order of genes implies selective bias which suggests related function. This method produces links between ortholog families validated by observed proximity in genomes representing multiple phylogenetic groups.

The Horn lab uses Data Mining of Enzymes (DME) to determine whether a protein is an enzyme and, if so then determines its EC classification. DME compares the sequence of a protein with a list of Specific Peptides (SPs) to search for matches of SPs within a given protein sequence. SPs are subsequences of amino acids within an enzyme and are responsible for an enzyme's specific function. They cover most of the annotated active and binding site amino acids and occur in the 3-D pockets that are proximate to the active site. SPs are extracted from enzyme sequences and are specific to levels of the Enzyme Commission functional hierarchy. Given a sequence of a protein, DME searches through a list of SPs to find alignment between SPs and subsequences of the protein. DME uses the EC assignment associated with each SP and provides the predicted EC for the query protein.

The Vitkup lab's approach integrates sequence-based and context-based correlations to probabilistically predict global metabolic networks for completely sequenced microbial genomes.
The method is based on Gibbs sampling of an entire metabolic network and provides probabilities/confidence for all metabolic annotation and alternative assignments.
For predictions submitted to COMBREX, they focus on predictions of specific metabolic molecular functions, i.e. four digits of the Enzyme Commission (EC) numbers.
Based on calculated and optimized context-based correlations as well as metabolic flux-balanced reconstructions, the genes responsible for many new metabolic molecular
functions are predicted. Context genomic correlations such as chromosomal gene clustering, phylogenetic profiles and gene fusion can provide important functional clues even
if sequence homology information is remote or absent.

This is a preliminary upload of functional annotations based on PHOG orthology (prior to the COMBREX workshop). We provide them to facilitate working out any issues that may arise while we are meeting face-to-face. Berkeley PHOG does not yet assign numerical confidence values, but I (Ruchira) have assigned the confidence value "0.2" to all annotations in this file to indicate our most permissive, "loose" parameter setting (equivalent to the tree threshold used for human-fruit fly orthology calls). We expect this set of annotations to change in two ways: 1) A subset of these gene annotations may be overridden subsequently with functional annotations based on a smaller tree threshold distance (i.e., with higher confidence values). In this case the method field (which contains the supporting PHOG accessions) will also be updated. 2) The majority of H. pylori proteins are present in PhyloFacts trees; however, we are continuing to build new trees to improve the annotation of a subset of H. pylori proteins through the use of advanced remote homolog detection clustering methods. We expect this to result in several additional annotations at various confidence levels.
homology information is remote or absent.