Bovine Genome Analysis

Sun, 03/09/2008 - 13:26

We have compiled the orthologous groups of the B.tar GLEAN genes in the context of Ensembl annotation of C.fam, H.sap, M.dom, M.mus, O.ana, R.nor genomes, e.g. all genes that presumably originated from a single gene of the last ancestor of the mentioned species are assigned the same orthologous group ID (just a number for now).

We would like also disclaim that large-scale automatic pipelines do mistakes, and manual phylogenetic analysis may prove more accurate, e.g. an incomplete gene model can lead to the gene being omitted from the classification if there are a few closely related orthologous groups, and artificially joint genes are pulled to the more conserved orthologous group.

The btar_jan08_pubogs.txt file format is just a tab delimited plain text,
where the columns are:
[orthol. group ID] [official gene ID] [internal gene ID]
(internal gene ID can be used to recognize the species by the 4 letter abbreviation).

The file Btar_Cfam_Hsap_Mdom_Mmus_Oana_Rnov.set-OGs-dist_count.txt presents gene copy-number analysis of the orthologous groups. That is counting number of genes in each of the species classified in one orthologous group, and then counting number of orthologous groups with the same phylogenetic copy-number profile. For example:

Btar Cfam Hsap Mdom Mmus Oana Rnov
6713 1 1 1 1 1 1 1

means that we have 6713 orthologous groups where each of the species have exactly one orthologous gene

2440 1 1 1 1 1 0 1

means there are in addition 2440 groups that are single-copy everywhere except platypus where there is no ortholog was found.

Files starting with ‘SUBSET-..’ correspond to subsets of orthologs with “interesting” to investigate further phylogenetic profiles, e.g. file
SUBSET-NO-MmusRnov_BtarCfam_MdomOana_NO-Hsap.txt
cntains 308 groups that have at least 1 gene in pairs of Btar/Cfam and Mdom/Oana and no orthologs in Mmus/Rnov and Hsap [literary “where (Btar>0 or Cfam>0) and Hsap=0 and (Mmus=0 and Rnov=0) and (Mdom>0 or Oana>0)”]

Files named as ‘cow_loose111_..’ are intended for for David Lynn and they are tar’ed and gzip’ed archives of Fasta formatted sequences (fs), multiple alignments in Fasta (aln), and the Gblocked multiple alignments (GBaln) for each of the 12’587 orthologous groups of loosely defined single-copy orthologs, e.g. where there is one ortholog in cow or/and dog, mouse or/and rat, opossum or/and platypus, and human. '.._cdna_fs' is the corresponding cDNA FASTA sequences pulled from Ensembl/Glean5 per orthologous group with all the internal, protein, transcript etc IDs in the headers, and '.._cdna_CodonALN' is PAML formatted and filtered cDNA alignments guided by the aa. alignments using PAL2NAL. [note! you may have a problem browsing the files due to their huge number if extracted in the same directory; tip for unix users – use ‘find’ utility.]

METHODS [as in PMID: 17947323]
Groups of orthologous genes were automatically identified from all-against-all protein sequence comparisons using the Smith-Waterman algorithm as implemented in ParAlign with default parameters, followed by clustering of best reciprocal hits from highest scoring ones to 10–6 e-value cutoff for triangulating BRH or 10–10 cutoff for unsupported BRH, and requiring a sequence alignment overlap of at least 30 amino acids across all members of a group. Furthermore, the orthologous groups were expanded by genes that are more similar to each other within a proteome than to any gene in any of the other species, and by very similar copies that share over 97% sequence identity, which were identified initially using CD-Hit. Only the longest transcript per gene was considered.