For functional annotation, similarity-based approaches [1] do not take into account all the information from comparative and evolutionary biology. They do not differentiate between orthologs and paralogs among homologs and, furthermore, the closest BLAST is often not the nearest neighbour [2]. Phylogenetic approaches taking into account duplication and speciation events are necessary to solve these problems. But they do not blend any data from transcriptional behaviour. Nevertheless, orthologs can have very similar 'molecular function' but undergo a different 'macroscopic function' because of a transcriptional shift.

Growing data for gene expression profiling are available in various databases concerning normal or pathological tissues (Expressed Sequence Tags [ESTs] from NR, TIGR, GeneNote, Gepis, etc.). Some works recently examined the correlation between evolution (duplication and speciation) of genes and expression divergence within and between species [3, 4], and some examine the expression profile between orthologous genes in sequenced species [5].

We performed a phylogenetic analysis of a protein family, using EST databases. This allowed us to enlarge the dataset of species containing homologs and consequently to improve the reconstruction of the genes' evolutionary history. We then extracted all the transcriptional data contained in EST databases, to decipher the gene expression pattern. Because gene annotation is currently labour intensive, we used a locally developed platform dedicated to phylogenetic annotation (named FIGENIX) [6]. We validated this approach on a family of genes possibly implied in rheumatoid arthritis; the peptidyl arginine deiminase (PADI) genes.

We show here a phylogenetic annotation with an enlarged dataset including EST contigs and expression data. It allowed us to integrate more functional data for analysis of a set of genes and permits us to give a transcriptional footprint of the gene. Our analysis showed that the PADI-2 paralog group have kept the ancestral molecular function coupled with a probable ancestral expression profile. These classified data permitted us to perform an updated footprint of the transcriptional data for each paralog group from this protein family.

We believe this method announces a new way to annotate uncharacterized ESTs. More than classical phylogeny, it allows highlighting of the transcriptional shift between paralogs, and is thus a good tool to improve annotation. It showed that functional shift can occur in differential tissue expression rather than in biochemical function of the protein.

This method of analysis is at its beginning and has to be extended to all kinds of expression database, including databases where expression data are normalized such as UniGene. In the future it cannot be ignored in annotating new unknown ESTs, underlined by DNA microarray assays for example.