Genome Comparison is a project of the Bioinformatics Team at the Department of Biochemistry and Molecular Biology of Fiocruz that used the compute power of World Community Grid to calculate the sequence similarity level among the whole protein content encoded in completely sequenced genomes of hundreds of organisms, including humans and several other species of medical, commercial, industry, or research importance. The calculated similarity indices will be used, together with standardized Gene Ontology, as a reference repository for the annotator community, providing an invaluable data source for biologists.

Only a fraction of the predicted protein content encoded in completely sequenced genomes has actually had their biological function and expression confirmed through laboratory analysis. The assignment of predicted biological functions and structural features to raw sequence data is called annotation, and is accomplished mostly by comparing them to predicted proteins or protein coding genes with information stored in different public domain databases around the world. However, annotation is often incomplete, uses non-standardized nomenclature or can be incorrect when inferred from previous incorrectly annotated sequences. Thus, an all against all controlled comparative database would be of great use as a reference.

Biological sequences (DNAs, RNAs, and proteins) are mostly compared in pairs through a process called pairwise sequence alignment, which consists of putting two sequences side-by-side in such a way that the number of identical positions between them is maximized. The sequences can be globally (taking the whole sequences) or locally (taking parts of the sequences) aligned, depending on the context and the purpose. The sequence similarity comparison program used in the Genome Comparison Project is called SSEARCH (W.R. Pearson [1991] Genomics 11:635-650), a freely available implementation of the Smith-Waterman rigorous algorithm (T. F. Smith and M. S. Waterman, [1981] J. Mol. Biol. 147:195-197) (algorithm is an organized procedure for performing a given type of calculation or solving a given type of problem), which finds the mathematically best local alignment between pairs of sequences.

The resulting all against all comparative database will be of great use as a reference for many research projects on functional aspects, biochemical pathways, evolutionary aspects, and an invaluable source for correct annotation of previously sequenced and newly obtained genome sequences

Precise annotation, assignment of possible functions to hypothetical proteins of unknown function, and the description of evolutionary relationships between proteins will be a major step forward towards our understanding of genome composition, genome evolution and cellular function

The contribution to the understanding of host-pathogen relationships, and the means to develop new drugs and vaccines, will be of utmost benefit to the scientific community at large

Research on biodiversity and new organisms will greatly benefit from reliable comparative data

Future new sequence releases will build upon the growing cross-referenced database

The software automatically downloaded small pieces of data (predicted protein sequences) and performed sequence comparisons to accurately calculate the similarity level among them. After the information was processed by members computers, the results were sent by World Community Grid to Fiocruz where they are being analyzed by the Bioinformatics Team at the Department of Biochemistry and Molecular Biology. Large-scale comparative analysis applying Smith-Waterman algorithm is computationally intensive and demanded exceptionally huge computational power, which is why it was a perfect project for World Community Grid.

The panel presented in the Genome Comparison agent application window represented the entities involved in the comparison process and a summary of the result achieved for a pair of them.

The small circles on the left side symbolized two different genes, pertaining to two distinct genomes or to a single genome. Inside of each circle we could see the unique number that identified the predicted protein sequence encoded by the gene in the source database.

The large circle on the right side of the panel showed the corresponding protein sequences, their descriptions, and the abbreviated name of the similarity scores and their calculated values for the particular pair of sequences.

The protein sequences were represented by an ordered string of letters (as encoded in their respective genes). Each of those letters stands for a different amino acid (M for methionine, S for serine, and so on) in the protein.

Most protein sequences are hypothetical or putative, which means that their existence have been computationally predicted but their expression by the respective cell or organism have not been experimentally confirmed yet.