Codon Usage Bias And Patterns In Genetics

Overlapping genes are the adjacent genes in the genome of an organism whose coding sequences overlap partially or completely and get translated to different proteins. This phenomenon has been known to occur in the genomes since its accident discovery in the bacteriophage Φ X174 (Barell et al., 1976; Sanger et al., 1977) while sequencing its genome. The development of the sequencing technologies has provided opportunity to sequence the complete genomes of a variety of organisms. The analyses of the genomes of the organisms in the different taxa, viz., viruses, prokaryotes and eukaryotes suggest that the overlapping genes are a common phenomenon among the genomes of all organisms (Normark et al., 1983; Wagner and Simons, 1994; Veeramachaneni et al., 2004; Makalowska et al., 2005; Kim et al., 2009). The most common reasons thought to be behind the formation of overlapping genes are a stop codon deletion or mutation or a near-end frameshift which leads to the extension of protein translation till the next in-frame stop codon (Fukuda et al., 1999, 2003). If the new stop codon is within the coding region of a neighboring gene, an overlap results. Overlaps can occur between the genes located on the same strand or between those located on the complementary strands of DNA. Bacterial and archaeal genomes have a very high density of genes and also the presence of overlapping coding regions between the consecutive gene pairs. The number of overlapping gene pair present in an organism is directly correlated with the total number of genes present in them (Fukuda et al., 2003; Johnson and Chisholm, 2004) and hence overlapping genes are present in a uniform rate across many organisms. In bacterial species, the same strand overlaps are more abundant (Fukuda et al., 1999; Johnson and Chisholm, 2004) as on an average, 70% of the genes in bacterial genomes are located on the same strand (Fukuda et al., 2003). As a mutation in the overlapping part affect both genes involved in the genomic overlaps, changes are less likely to occur in the overlapping part (Krakauer, 2000; Miyata and Yasunaga, 1978). Researches indicate that the genomic overlaps are conserved in prokaryotes and hence can be used as rare genomic markers for inferring phylogeny (Luo et al., 2006). Overlapping genes have regulatory properties (Normark et al., 1983; Cooper et al., 1998) and the sequence composition of overlapping proteins have bias toward disorder-promoting amino acids and are predicted to contain significantly more structural disorder than non-overlapping proteins (Rancurel et al., 2009). The present study was aimed to find the occurrence of overlaps in the microbial species for which the complete genomes were available from the NCBI ftp, determine whether the codon usage from the overlap part exhibit any bias when compared to that of the normal part of the genomes and to find the patterns, if any, in the codon usage bias by the overlap part in different levels of taxonomy and investigate the biological significance of the codon usage bias patterns.

2 Literature review

2.1 Concept of gene

The concept of gene had evolved continuously after the term was coined. According to the classical view in the 1930‘s, gene was conceptualized as an indivisible unit of genetic transmission, recombination, mutation, and function. The neoclassical concept of gene came in the 1940s after the genetic recombination was uncovered leading to the establishment of DNA as the physical basis of inheritance. During this period, the gene was termed cistron, with its constituent parts, mutons and recons and each cistron synthesized a polypeptide. This concept prevailed until 1970‘s. From the 1970‘s, with discoveries of repeated genes, split genes and alternative splicing, assembled genes, overlapping genes, transposable genes, complex promoters, multiple polyadenylation sites, polyprotein genes, editing of the primary transcript, and nested genes etc., a rather abstract, open, and generalized concept of the gene had come into existence, which can assume various definitions suiting the context. The other facts which affected the concept of genes include gene fusion, obscurity of the transcriptional unit boundaries, presence of encrypted genes in the organelle genomes of microbial eukaryotes and prokaryotes, inheritance of the functional status of the gene in addition to its structure, epigenetic inheritance etc. Observations from recent years about different aspects of the gene viz., its structure, function, regulation, and the inheritance clearly have changed the conventional concept of the gene. Several efforts have been there to reformulate the definition for a gene. A new comprehensive definition of the gene named the ‗systemic‘ or ‗relational‘ concept of the gene (Portin, 2009), emphasize that all parts of the organism in which the genes reside must be taken into account moving from the traditional reductionistic approach to a more systemic one and which coincides with that by Beurton (Beurton, 2003) who stated that ―Genes themselves are products of evolutionary forces at work on the population level. In such a perspective, the issue of reductionism is emptied of all content‖.

2.2 Overlapping genes

2.2.1 Discovery and prevalence

The knowledge about the existence of overlapping genes in the genomes, where the coding sequences of the adjacent genes in a genome overlap partially or entirely, getting translated to different polypeptides, was acquired as a result of the sequencing efforts to unveil the complete genome sequences of the organisms. Overlapping genes have been discovered to occur in the genomes since the mid 1970‘s and was invented first from the single stranded DNA-phage, Φ X174, while sequencing its genome (Barell et al., 1976; Sanger et al., 1977). The discovery of overlapping genes chipped in to the crisis of the gene concept, which was widely accepted to be having a linear model (Portin, 1993, 2009). The first overlapping gene pair discovered was the genes D and E in the bacteriophage Φ X174 which overlap each other and are translated in two different reading frames from a common DNA sequence. Similar overlapping genes were found later in the genomes of DNA viruses, RNA viruses, prokaryotes and eukaryotes (Normark et al., 1983; Samuel 1989).

Extensive studies on the overlapping genes were initially done in viral genomes, where the genomic size constraints made it a common instance to maximise the information content (Dillon, 1987; Pavesi, 2000; McGirr and Buehuring, 2006; Pavesi, 2006; Bofkin and Goldman, 2007). Overlapping genes have also been found in the genomes of bacteriophages, viruses, mitochondria and prokaryotes. The gene density in the genomes of the bacteria and archaea are very high with more than 90% of their genomic DNA coding for proteins and there are many pairs of genes in their genomes, whose coding regions overlap. In Escherichia coli, promoter of the ampC β-lactamase gene found to be located within the last gene of the fumarate reductase (frd) operon, acts as the transcription terminator of its preceding operon (Grundström et al., 1982). The first overlapping gene in eukaryote was discovered in the Drosophila melanogaster (Henikoff et al., 1986), where a pupal cuticle protein (Pcp) gene was found within the first intron of the Gart locus, which encodes three enzymes involved in the purine biosynthesis. Further incidences of the overlapping gene in eukaryotes were found in Drosophila‘s dopa decarboxylase (Ddc) region (Spencer et al., 1986) and in mouse (Williams et al., 1986), whose works got published in the same issue of the Nature in 1986. The instances of overlapping genes were found in humans since the finding of the overlap between the last exon of the P450c21, the gene that encodes human adrenal steroid 21-hydroxylase and the transcript from its opposite strand (Morel et al., 1989). Other instances of the overlapping genes in humans has been revealed by the research conducted by different scientists (Bristow et al., 1993; Kennerson et al., 1997; Cooper et al., 1998; Zhou et al., 2003). Occurrence of overlapping gene pairs in plants have also been exposed in several studies (Terryn and Rouzé, 2000). Novel pairs of overlapping genes were found as a fortunate result of the studies on some specific gene loci (Vanhee-Brossollet and Vaquero, 1998). But only those overlapping gene pairs giving rise to the natural antisense transcripts were further investigated as the genome overlap was considered uncommon and had little knowledge about their biological function (Dolnick, 1997; Vanhee-Brossollet and Vaquero, 1998; Kumar and Carmichael, 1998). Bioinformatics tools were used for discovering and analyzing overlapping genes since the draft sequences of human, mouse and fruit fly became available in the public databases (Shendure and Church, 2002; Fahey et al., 2002; Yelin et al., 2003; Kiyosawa et al., 2003). But, there can be changes in the inferred number of overlapping genes as the analyses of the overlapping genes are hindered by the poor annotation, sequencing errors, limitations of the gene-finding algorithms etc. (Burge & Karlin, 1998).

2.2.2 Origin of genomic overlaps

A variety of reasons have been proposed about the origin of the overlaps in the genomes. Overlapping genes were proved to be evolved for packing more amount of genetic information to the genomes as a result of the studies conducted in the viruses by McGirr and Buehuring, Pavesi, Bofkin and Goldman etc. Research conducted in later years also lead to the confirmation that the overlaps between genes happens owing to the mutational bias for deletion (Clark et al., 2001). An alternative theory about the existence of the overlapping genes states that they may have a significant role in the regulation of gene expression through translational coupling and in the regulation of protein-protein interaction (Normark et al., 1983; Chen et al., 1990; Inokuchi et al., 2000). A comparative study conducted on the genomic overlaps in the two different Mycoplasma species viz., Mycoplasma genitalium and Mycoplasma pneumoniae by Fukuda et al. shows that the genomic overlaps in the analyzed species were created by the 3‘ end elongation because of a stop codon loss in either of the genes involved in the overlap. The stop codons can be lost due to deletion, a point mutation or a frame shift in the end of the coding part, resulting in the elongation of 3‘ end of the coding region. In other studies, overlapping genes were suspected to be arisen as a result of overprinting (Keese and Gibbs, 1992; Sander and Schulz, 1979) which generates different coding sequences from an existing nucleotide sequence by translating it de novo in a different reading frame or from noncoding open reading frames, evident from the original gene function retention in the phylogenetic studies. Also, translation of multiple reading frames can happen by internal de novo initiation in an alternative reading frame without any need for ribosomal frame shifting (Atkins et al., 1979; Chang et al., 1989), although translation in the different reading frames is mediated by ribosomal frame shifting (Jacks et al., 1988; Wilson et al., 1988; Brierley et al., 1989). Recent studies by Cock and Whitworth have presented a model in which they have attributed the N- terminal extension of the downstream genes by the adoption of new start codons as the reason for the abundance of the unidirectional overlaps with relative reading frame bias in prokaryotes (Cock and Whitworth, 2007, 2010).

2.2.3 Classification of the overlapping genes

Different criteria had been used for the classification of overlapping gene pairs including the gene reciprocal orientation, overlap extent and the regions involved in the overlap (Kiyosawa et al., 2003; Boi et al., 2004; Solda et al., 2008) etc. Depending on the reciprocal direction of the transcription of the gene pairs involved in the overlap, they can be classified into parallel, where both of the genes are transcribed in the same direction and antiparallel where the genes involved in the overlap are transcribed in the opposite direction. Antiparallell overlaps can further be divided into convergent, where the genomic overlap is formed by the 3‘ termini of both of the genes and divergent, where the genomic overlap is formed by the 5‘ termini of the genes. According to the overlap extent, the overlapping genes can be classified into ―complete‖, when the sequence of one of the gene involved in the overlap occurs entirely within the sequence of the other gene and ―partial‖ where the only a part of the sequences of both genes overlap (Kiyosawa et al., 2003; Boi et al.2004; Solda et al., 2008). The ―complete‖ overlap can further be divided into ―nested genes‖, where the entire sequence of one of the gene lies within an intron of the other and ―embedded genes‖, which shares more than one intron or exon. Another system classifies the overlaps according to the regions involved in the overlap viz., 5‘untranslated region (UTR), 3‘ untranslated region (UTR), coding sequence (CDS) and introns (Boi et al., 2004). The classification of the overlaps as given in the article published by Solda et al. and the extent of overlap in the different kinds of overlaps is given in Figure 2.1.

2.2.4 Phase and strand bias by overlapping genes

Effect of a gene overlap varies according to the type of the overlap, regions involved in the overlap, phase of the overlap etc. Majority (~84%) of known overlapping genes in eukaryotes lie on opposite strands with an antiparallel arrangement (Boi et al., 2004) as concluded by Boi et al. Studies on overlapping genes in pairs of human–mouse ortholog genes also shows that majority (~90%) of the overlaps are different strand overlaps and the evolutionary transition between overlapping and non-overlapping genes are mainly caused due to higher rates of evolution in the untranslated regions, mainly in the 3‘ untranslated regions (Sanna et al., 2008). Evidence from eukaryotes shows that antiparallel overlaps, by the means of double stranded RNA formation, exerts roles in different biological processes, including transcription, RNA editing, mRNA splicing and stability, and translation (Vanhee-Brossollet and Vaquero, 1998; Kumar and Carmichael, 1998). Korbel et al. states that if the genes involved in the overlap are transcribed divergently with conserved gene orientation, they must be strongly co-regulated (Korbel et al., 2004).

On the other hand, the pattern is opposite in prokaryotes in which the same strand overlaps forms the majority of the overlaps. The unidirectional overlaps are the most conserved in prokaryotic genomes and as the conserved operons strongly indicate functional associations, links can be predicted between all the conserved overlapping gene pairs (Dandekar et al., 1998; Overbeek et al., 1999; Johnson and Chisholm, 2004). The presence of more opposite strand overlaps in eukaryotes might be due to the presence of complex gene structures with introns and because of the role, the genomic overlaps had in the regulation of gene expression. Phase describes the shift in the reading frame between the adjacent genes (Rogozin et al., 2002). Depending on which codon positions face each other in an overlap, the effects of DNA mutations on the two participating genes can be different. The different ways of placing codon positions against each other are termed phases. For each type of overlap, there can be three distinct phases, except for unidirectional overlapping. In prokaryotes, same strand overlaps occur mostly in the +1 and +2 reading frames, whereas different-strand overlaps are evenly distributed in the three reading frames (Johnson and Chisholm, 2004).

2.2.5 Databases of overlapping genes in prokaryotes

There are a variety of databases that contains information about the overlapping genes in prokaryotes. The interactive data base server called BPhyOG (Luo et al., 2007), freely available at http://cmb.bnu.edu.cn/BPhyOG, can be used for reconstructing the phylogenies of completely sequenced prokaryotic genomes based on their shared overlapping genes. OGtree (Jiang et al., 2008), a web-based tool can also be used for the genome tree reconstruction of some prokaryotes based on the distance between the overlapping genes, viz., (1) overlapping gene content, i.e., the normalized number of shared orthologous OG pairs and (2) overlapping gene order, i.e., the normalized OG breakpoint distance (Jiang et al., 2008). As OGtree used breakpoints for defining the overlapping gene distance, it could not be used for calculating the overlapping gene distance of multi-chromosomal genomes. A new overlapping gene order distance which takes into account the genomic rearrangements viz., e.g., reversals, transpositions and translocations and use of regulatory regions to define the overlapping gene finding had been introduced to OGtree (Jiang et al., 2008) to create a new server to reconstruct genome trees more precisely, the OGtree2 (Cheng et al., 2010). PairWise Neighbours (Pallejà et al., 2009) is another database which contains information about the spacers and overlapping genes among bacterial and archaeal genomes and their conservation across species. Currently, it houses 1,956,294 gene pairs from 678 fully sequenced prokaryote genomes and is freely available at http://genomes.urv.cat/pwneigh. The database also permits the reliability analysis of the overlapping structures by taking into account the presence and location of the regulatory signal, Shine-Dalgarno (SD) sequence, which is the ribosomal binding site in mRNA located 8 base pairs upstream of the start codon, among the adjacent prokaryotic genes, which inturn is responsible for the efficient translation.

2.2.6 Significance of genomic overlaps

Studies about the overlapping genes in the genomes of organisms in different level of taxonomy lead to different conclusions about their existence in the genomes. Major reasons for their existence are that they play a major role in the genome size minimisation (Sakharkar et al., 2005), gene expression regulation (Johnson and Chisholm, 2004), restriction on the evolution of the genes as a change in the overlapping part changes both of the proteins translated (Keese and Gibbs, 1992; Krakauer and Plotkin, 2002) etc. The evolutionary rate is slower for those genes which acquire an overlapping pattern of gene expression (Miyata and Yasunaga, 1978). Novel genes formed and their products may have major evolutionary implications (Keese and Gibbs, 1992). As a result of the study conducted on 13 different γ-Proteobacteria genomes, it has been proved that overlapping genes can be used as rare genomic markers to get insight into the phylogeny of the completely sequenced microbial genomes (Luo et al., 2006). The studies on the natural anti-sense transcripts (NATs) in prokaryotic cells (Lacatena and Cesareni, 1981; Itoh and Tomizawa, 1980) have aided to understand and interpret the role genomic overlaps played in the cell regulation. Natural anti-sense transcripts are endogenous transcripts showing complementarity to the sense transcripts with a known function. They may originate by cis-encoding, transcribed from the overlapping part showing a perfect complementarity to the sense transcript or trans-encoding, transcribed from a different locus (Vanhee-Brossollet and Vaquero, 1998). The published reports shows that the antisense transcription is more common than it have been thought of in the eukaryotes as well. The role of complementary transcripts in regulatory processes including transposition, plasmid replication and transcription etc. had been demonstrated in bacteria (Wagner et al., 2002). Most antisense RNAs are posttranscriptionally acting inhibitors of target genes, but a few examples of activator antisense RNAs are known (Wagner et al., 2002). Studies based on the experimental data showed that the part of the vertebrate mRNAs comprised of conserved regions in their untranslated regions and protein coding regions have very important roles in the regulation of mRNA stability. They form long perfect duplexes with antisense transcripts and this is very essential for the recognition by post-transcriptional regulatory systems (Lipman, 1997). Antisense transcript studies on the human genome found that most of the genomic overlaps occur in the 5‘ or 3‘ exons which contain untranslated regulatory regions of mRNAs leading to the conclusion that sense antisense overlap may be associated with gene regulation and these areas are highly conserved (Yelin et al., 2003).

The information about the antisense transcription help in the study of the RNA interference (Bosher and Labouesse, 2000) which in turn is based on the presence of dsRNA leading to gene silencing, in selecting synthetic antisense oligonucleotides in functional studies and drug design (Delihas et al., 1997; Yelin et al., 2003). In yeast, out of the newly discovered 137 open reading frames, 79 non-annotated and expressed open reading frames were found to occur opposite the previously annotated genes (Kumar et al., 2002). Hence, overlapping genes are present throughout the eukaryotic genomes like that in the prokaryotes and viruses showing also a role in the antisense-mediated gene regulation in eukaryotes. The analyses of the proteins coded by the manually curated overlapping genes from 43 genera of unspliced RNA viruses infecting eukaryotes shows that the protein composition is biased towards disorder-promoting amino acids with more structural disorder than nonoverlapping proteins (Rancurel et al., 2009). In these viruses, some of the overlapping proteins created were de novo and present only in certain species or genus with a role in the viral pathogenity or spread and no role in the viral replication or structure. Some of the novel proteins predicted to be having ordered structures had novel folds (Rancurel et al., 2009). Comparison of amino acid composition revealed an increased frequency of amino acid residues with a high level of degeneracy (arginine, leucine, and serine) in the proteins encoded by overlapping genes which can be viewed as a way to expand their coding ability and gain new specialized functions (Pavesi et al., 1997). Overlapping genes are present in almost all of the sequenced microbial genomes and it has been estimated that a third of the microbial genes are overlapping (Fukuda et al., 2003; Johnson et al., 2004). A strong correlation has been found between the total number of genes and the total number of overlapping genes that are present in the genome of an organism (Fukuda et al., 2003; Johnson et al., 2004). Overlapping genes are conserved between the species than the non-overlapping genes as the mutation in the overlap part causes changes in both of the genes involved in the overlap (Krakauer, 2000; Sakhar et al., 2005). Hence selective forces against such mutations will be stronger and therefore the overlapping genes can be used as phylogenetic markers or characters for the evolutionary tree reconstruction among bacterial genomes (Luo et al., 2006; Luo et al., 2007) by using the normalized number of shared orthologous overlapping gene pairs as the distance measure.

2.3 Codon usage

2.3.1 The genetic code

The genetic code triplets are made up of the four nucleotides, viz., adenine, guanine, cytosine, uracil, which were envisioned to account for the 20 amino acids (Crick et al., 1961). The genetic code is sustained in all the living things with a few reassignments (Knight et al., 2001) and is hence assumed to be nearly universal. The ―frozen accident hypothesis‖ states that the standard genetic code was fixed as all living things share a common ancestor, with the codons changing subsequently without adding deleterious effects of codon reassignment. The codons in the standard table are arranged nonrandomly and there are different concepts on the origin and evolution of the code which are not mutually exclusive. According to the stereochemical theory, codons are assigned by the physicochemical affinity between amino acids and the anticodons. The coevolution theory states that the structure of the codons are coevolved along with the amino acid biosynthesis pathways and code is evolved so that it minimizes the adverse effect of point mutations and translation errors. The studies on the evolution of the genetic code and structural analyses indicate that the genetic code is robust to translational misreading. But, evidence for the presence of more robust codes points to the fact that the genetic code might have originated as a result of a combination of frozen accident and selection that minimizes error.

2.3.2 Variations in the standard genetic code

The variations in the standard genetic code were discovered since an alternative genetic code used by the human mitochondrial genes was found. Now, there are 16 additional alternative genetic codes, in addition to the standard genetic code that are available in the NCBI‘s taxonomy database (Sayers et al., 2010) which codes for proteins in different organisms. The different genetic codes are given in the Figure 2.2. The translation table that have been used for the plastids of the bacteria, archaea and plant, translation table number 11 and that for the mold, protozoan, and coelenterate mitochondrial code and the mycoplasma/spiroplasma code with the number 4 is illustrated in the Figure 2.3 and Figure 2.4. Although with the translation table 11, translation initiation is most efficient at AUG, two additional codons GUG and UUG, can also serve as start in archaea and bacteria (Kozak, 1983; Fotheringham et al., 1986; Golderer et al., 1995; Nolling et al., 1995; Sazuka and Ohara, 1996; Genser et al., 1998; Wang et al., 2003). The codon UUG can act as initiator codon for around 3% of the bacterium's proteins in E. coli (Blattner et al., 1997) and the codon CUG can function as an initiator for one plasmid-encoded protein (RepA) (Spiers and Bergquist, 1992). Exceptional cases are there where the bacteria use AUU, in addition to NUG as the translation initiation codon (Polard et al., 1991; Binns and Masters, 2002). UGA codes less efficiently for tryptophan in Bacillus subtilis and, presumably, in E. coli (Hatfield and Diamond, 1993), but the internal assignments are the same as the standard code. The translation table number 4 differs from the standard genetic code in that the termination code UGA in the standard code is used to code tryptophan. There are alternative initiation codons including UUA, UUG, CUG for Trypanosoma; AUU, AUA for Leishmania; AUU, AUA, AUG for Tertrahymena and AUU, AUA, AUG, AUC, GUG and possibly GUA for Paramecium. The table number 4 is used for bacteria, fungi, some eukaryotes and metazoa. In bacteria, the code is used by some orders in the class Mollicutes viz., Entomoplasmatales and Mycoplasmatales. The codon reassignment of UGA from Stop codon to tryptophan is found in a α-proteobacterial symbiont of cicadas: Candidatus Hodgkinia cicadicola (McCutcheon et al., 2009). This table is also used by certain fungal species, eukaryotic species including Gigartinales from the red algae, and the protozoans Trypanosoma brucei, Leishmania tarentolae, Paramecium tetraurelia, Tetrahymena pyriformis and Plasmodium gallinaceum (Aldritt et al., 1989). In the metazoa, the code is applicable to Coelenterata comprising the phyla Ctenophora and Cnidaria. This code is also used for the kinetoplast DNA (maxicircles, minicircles), which are modified mitochondria (or their parts).

2.3.3 Codon usage bias

Out of the 64 triplet codons in the genetic code, only 61 triplets code for amino acids and 3 are stop codons. All the 20 different amino acids, except methionine and tryptophan, are encoded by more than one codon making the genetic code degenerate. The different codons producing the same amino acids are called synonymous codons. Evidence from diverse organisms indicates that the alternative synonymous codons are not used with equal frequency considering the amino acid incorporation into the polypeptide.

The patterns in the codon usage had been analyzed since the collation efforts of the first molecular sequence databases (Grantham et al., 1981). The studies on the patterns of codon usage by the synonymous codons demonstrate that the genes within a species show similar patterns of codon usage (Grantham et al., 1980; Grantham et al., 1981) as stated by the genome hypothesis (Grantham et al., 1980). Heterogeneity in the codon usage within species was demonstrated first in E. coli where a subset of the codons best recognised by the most abundant tRNAs, were used more in the highly expressed genes (Gouy and Gautier, 1982; Ikemura, 1985; Sharp and Li, 1986). Other evidences of different patterns of codon usage among the different genes within a species are also there (Aota and Ikemura, 1986; Sharp and Li, 1986). Multivariate analyses also agreed with the fact that for each species there could be a trend for codon usage among the genes, from highly biased to even usage of synonymous codons. Summing up the patterns of all the genes in an organism to get the codon usage of the organism, may conceal the underlying heterogeneity (Aota et al., 1988). Hence, it is better to specify the codon usage trends among the genes in a species and closely related species show similar patterns of codon usage.

2.3.4 Causes of codon usage bias

A variety of causes and consequences of the variability in the codon usage have been identified (Sharp and Cowe, 1991). Codon usage bias occurs either due to one or a combination of the several factors including mutational bias, translational selection among synonymous codons, or selection against particular structures in DNA. Amino acid usage also varies between proteins which inturn is correlated with the properties of the proteins (Lobry and Gautier, 1994). Different studies have concluded that the variation in codon usage occurs due to the translational selection (Grantham et al., 1981), replication-transcriptional selection (McInerney, 1998), mutational bias (Levin and Whittome, 2000) etc. Other biological phenomena associated with the codon usage bias are the gene expression level, gene length, translation initiation signal of the gene, amino acid composition of the protein, structure of the protein, abundance of the tRNA, frequency and patterns of mutation and the GC compositions (Ikemura, 1981; Gouy and Gautier, 1982; Bernadi and Bernadi, 1986; Bains, 1987; Karlin and Mrazek, 1996; Ma et al., 2002; D‘Onofrio et al., 2002; Wan et al., 2003; Wan et al., 2004; Gu at al., 2004). Codon usage bias quantification within and among the genomes of different organisms can give insights into to evolution and environmental adaptation of the living organisms.

2.3.5 Calculation of codon usage bias

Codon usage bias can be evaluated using methods based on the statistical distributions and methods based on comparing the codon usage to that of the optimal codons that inturn serves as a reference (Bennetzen and Hall, 1982). Methods based on the statistical distributions include codon-usage preference bias measure based on the χ2 and scaled χ2 analyzes (McLachlan et al., 1984; Shields and Sharp, 1987). Another method based on the Shannon informational theory, called synonymous codon usage order, measures the synonymous codon usage bias with a genome and across the genomes (Bernadi and Bernadi, 1986; McLachlan et al., 1984). Codon and amino acid usage bias are also affected by the GC composition (Knight et al., 2001). The web server, CodonO, analyses codon usage bias and its correlation with the GC compositions, within and across the genomes (Angellotti et al., 2007). Codon usage bias and amino acid usage bias can also be analyzed using multivariate analyses of which the correspondence analysis and the cluster analysis were among the most used. Cluster analysis partitions the data based on the trends within the data and the correspondence analysis finds trends in the data and distributes the genes in a genome or genomes of different species according to the trends along the axes. Another multivariate analysis method used to investigate the heterogeneous codon usage in a wide range of species is principal component analysis (Grantham et al., 1980; Kanaya et al., 1996a; Kunst et al., 1997). These multivariate methods are applied usually on relative codon usage frequencies (normalised data) instead of the absolute codon usage frequencies to ward off the biases in gene length and amino acid usage, which make the variation in synonymous codon usage unrecognizable. Multivariate analyses are performed with rectangular matrices where the columns represent the relative codon frequencies and the rows represent the individual species.

2.3.6 Codon usage database

Codon usage database (Nakamura et al., 2000) is an extended web version of the codon usage tabulated from GenBank (Nakamura et al., 1997; Benson et al., 2010), denoted by CUTG. The database provides information about the frequencies of the codon usage in different organisms, searchable through the web, complied from the taxonomical divisions of the GenBank sequence database. The database has been updated with the nucleotide sequences obtained from the NCBI-GenBank Flat File Release 160.0 and presently houses codon usage tables for 35,779 organisms obtained from completely sequenced protein coding genes (coding sequences), avoiding codons with ambiguous bases. Codon usage for the organisms and for each of the genes in an organism is available through the web site http://www.kazusa.or.jp/codon/ either in the codon frequency compatible format or in the traditional table format, helping to analyze variations in codon usage among different genomes.

2.3.7 GenBank

GenBank (Benson et al., 2010) is a comprehensive and publicly available nucleotide sequence database maintained by the National Center for Biotechnology Information, NCBI, at the National Institute of Health, formed in 1988 for developing information systems for molecular biology. GenBank receives nucleotide sequences primarily from the scientific community and from the daily data exchange through International Nucleotide Sequence Database Collaboration (INSDC) which is a combined effort by DNA Databank of Japan (DDBJ), European Molecular Biology Laboratory (EMBL) and GenBank which ensure uniform and comprehensive collection of sequence information worldwide. GenBank can be accessed through the NCBI Entrez retrieval system, the bi-monthly releases and daily updates of the GenBank database can be accessed via anonymous ftp from NCBI at ftp.ncbi.nih.gov/genbank and the complete genomes are available from the ftp, ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/. GenBank releases are distributed in the flat file format as well as in the Abstract Syntax Notation 1 (ASN.1) version for internal maintenance at the NCBI‘s anonymous ftp server.

2.3.8 Taxonomy database at NCBI

Taxonomy database at NCBI attempts to incorporate phylogenetic and taxonomic knowledge from a wide range of sources viz., information from the published literature, web-based databases, information from the sequence submitters, taxonomic experts outside NCBI etc. The consistent taxonomy that is provided in the NCBI‘s sequence databases (Sayers et al., 2010) is taken from the Taxonomy database. It comprises the names and lineages of all the organisms for which at least one nucleotide or protein sequence is available in the NCBI‘s sequence databases. As of May 2010, there are 372,435 taxa excluding the uncultured and unspecified. NCBI taxonomy database serves as the standard reference for the International Nucleotide Sequence Database Collaboration (INSDC) involving GenBank, EMBL and DDJB. The database provides links to all data for each taxonomic node from super kingdom to subspecies. The taxonomy browser can be used to view the taxonomic position or retrieve data from any of the Entrez databases for a particular organism or group. The database is also available as files from the ftp site (ftp://ftp.ncbi.nih.gov/pub/taxonomy/) and as a domain of Entrez, (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy), which will be updated every two hours.

2.4 Multivariate analyses for detecting codon usage bias

2.4.1 Principal component analysis

Principal component analysis (PCA) is an unsupervised statistical learning method used for transforming high dimensional data to lower dimensional space (Morrison, 1976; Jolliffe, 2002), preferably two dimensional. It can be used for exploring and generating consistent patterns within the data. It has applications in wide range of research areas including image processing, genomic analysis, information retrieval etc. It describes the structure of the high dimensional data by reducing its dimensionality to the uncorrelated principal components, which can explain the variation in the data (Morrison, 1976). The first and second principal components account for as much variability in the data as possible. Further components account for smaller amounts of the residual variation. The plot of the principal component analysis with the first principle component versus the second, shows the distribution of the points on a principal plane where x-axis corresponds to point projection on the first principle component and y-axis to the projection on the second principle component.

2.4.2 Self organizing maps

The multivariate analysis method, principal component analysis has low resolving powers when analyzing genes from a large number of species simultaneously. Hence the neural network algorithm with higher resolving power, self-organizing maps, can be used to analyze the codon usage bias patterns. The self organizing map is a form of the artificial neural networks that was first proposed in the 1970‘s (von der Malsburg, 1973; Kohonen, 1982) and it works on the principle of unsupervised learning. It can be used for clustering and visualising the trends inherent to the problem. It creates a map of a set of high dimensional input vectors to a low-dimensional (one or two dimensional) grid through vector quantization (Kohonen et al., 1996). Self organizing map uses a neighborhood function and preserve the topological properties of the input space (Kohonen et al., 1996). It is useful for visualizing low-dimensional views of high-dimensional data. Self organizing maps were proved to be efficient for characterising horizontally transferred genes (Kanaya et al., 2001). Self organizing maps operate by training when it constructs the map using input examples by vector quantization and by mapping when it classifies the new input vectors. Training phase initializes each node's weights, chooses an input vector randomly from training set data and present them to the lattice. Each node is examined to calculate which one's weights are most like the input vector and the node with a weight vector closest to the input vector is called best matching unit. The radius of the neighborhood of the best matching unit is calculated and the nodes within this radius are considered to be in the best matching unit's neighborhood. Each neighboring node's weights are adjusted to make them more like the input vector. The closer a node is to the best matching unit, the more its weights get altered. The area of the neighborhood shrinks over time by an exponential decay function until it shrinks to the size of just one node, the best matching unit. The learning rate decays exponentially to zero in the iteration steps. The effect of learning is proportional to the distance a node is from the best matching unit.

2.4.3 Correspondence analysis

Correspondence analysis (Greenacre, 1984) is a method to quantify categorical data by assigning numerical scale values to the response categories of discrete variables, with certain optimal properties. These scale values have been shown to have interesting geometric properties and provide maps of the relationships between variables. The correspondence analysis produces a map of the data that can be represented by a contingency table, where each row and column is represented by a point. The working principle is similar to the principal components analysis (PCA), the total variance of the data is decomposed optimally along the principal axes. The first and the second principal axes account for a large percentage of total variance which allows the data to visualize in two dimensions. The correspondence analysis is based on three basic concepts viz., (1) the point in multidimensional space, (2) the weight associated with the point and (3) the χ2 distance, which is the distance function between the points. The correspondence analysis projects the points into a low dimensional subspace, usually two-dimensional plane that optimally fits the points by χ2 distance and hence causes dimension reduction. In the correspondence analysis, total variance is measured by the inertia, the Pearson χ2 statistic calculated on the cross-tabulation divided by the total sample size. The coordinates of the row profile points are projected onto the best-fitting plane. The coordinates of the row profile points are the principal coordinates, the coordinates with respect to the principal axes of the space. Each principal axis accounts for a certain amount of the total inertia (principal inertia) and is usually expressed as a percentage of the total. The two different ways of mapping the columns along with the rows are the asymmetric map and the symmetric map. In asymmetric map, the row profiles are depicted by principal coordinates, and the column points by the projections of unit profile vectors onto the same space, with the problem being that the column points are spread out more than the row points. Symmetric map represents the row points and column points in principal coordinates and each of them are projected in different spaces.

2.4.4 Heat maps

The clustered heat map is one of the popular graphical representations that can display the large amounts of data compactly to a smaller space visualizing the coherent patterns in the data. Heat maps can reveal both the row and column hierarchical cluster structure from a data matrix. It is made up of rectangular tiling and each tile is shaded with a color scale according to the value of the corresponding element in the data matrix. They had been used since 1997 (Weinstein et al., 1997) for displaying two dimensionally, the expression patterns of messenger RNA, microRNA, protein, DNA copy number, DNA methylation, metabolite concentration (Brauer et al., 2006), drug activity etc. They have also been useful for the expression analyses of the microarray data (Eisen et al., 1998) for different organisms and for various diseases (Weinstein et al., 1997; Wang et al., 2006). In the heat maps displaying the gene expression data, the color on the rectangular tiling is proportional to the expression of the RNA or protein in the sample viz., red for higher expression and either green or blue for low expression.

The rows and columns will be rearranged so that similar rows and columns will be near to each other in the display. Cluster relationship is indicated by the dendrograms along the horizontal and vertical margins and the color patches indicate the functional relationship between the genes and the samples. But, the heatmaps may not be able to show the complex patterns of nonlinear relationship in the samples and the bifurcation of the cluster tree should be specified. The functional ordering of the axes, the coherent patterns generated and the meaning revealed by the clustered heat map depends on the choice of the preprocessing algorithm (type of background subtraction, data normalization, data filtering etc.), which minimizes noise while keeping the meaningful signal; clustering algorithm (average linkage, complete linkage, or centroid-based), that decide the grouping of the data; the distance metric (Euclidean or correlation), that defines the measure of similarity; and the color scheme (linear, logarithmic, quantile, two-color, three-color etc.), that emphasizes the patterns to be visualized (Weinstein, 2008). Also the patterns differ according to the use of relative or absolute data to create a "difference" heat map. Hence different kinds of heat maps can be generated from the same experiment, each having its own visual meaning and therefore it is important to specify the parameters for interpreting the heat maps (Weinstein, 2008).

2.5 χ2 test

Pearson's χ2 test is a uni-variate test used for performing tests of goodness of fit and tests of independence (Agresti, 2002; Thompson, 2009). χ2 test for the goodness of fit is used for evaluating whether an observed frequency distribution differs from a theoretical frequency distribution. The test of independence assesses whether a pair of observations on two variables, expressed in a contingency table, are independent of each other.

2.5.1 Test of goodness of fit

χ2 test of goodness of fit is used for testing the validity of a distribution and evaluates the null hypothesis, H0, which states that the data belongs to an assumed distribution against the alternative hypothesis, ―Ha―, that states that the data does not come from the assumed distribution. In order to provide proofs for the hypotheses testing, the χ2 test statistic, denoted by χ2 is calculated. The χ2 test statistic is calculated by using the equation, n χ2 ═ Σ (Oi –Ei) 2 / Ei, i=1 where χ2 = Test statistic that asymptotically approaches a χ2 distribution. Oi = Observed frequency; Ei = Expected frequency, given by the null hypothesis; n =Number of possible outcomes of each event. The χ2 test statistic value obtained from the calculation can be compared to the χ2 distribution for the p-value calculation. P-value is the relative standard used for determining whether the null hypothesis is to be rejected or not, representing the probability that the deviation of the observed frequencies from the expected frequencies is due to chance alone. Usually a p-value of 0.05 or less makes the null hypothesis to be rejected indicating that the data are not fitting to each other.

2.5.2 Test of independence

The χ2 test of independence is used to test whether two outcomes of a single observation are statistically independent. The null hypothesis, H0, is that the two outcomes are statistically independent and the alternative hypothesis, Ha, the vice versa. The outcomes will be arranged in a two way contingency table. The value of the test-statistic is calculated by, r c χ2 ═ Σ Σ (Oi, j –Ei, j) 2 / Ei, j i=1 j=1

The χ2 test of independence evaluates whether the variables within a contingency table are independent or not associated. The χ2 statistic is calculated by summing up the squared difference between observed and expected data and dividing it by the expected data in all possible categories. The observed frequency is the number of observations present in each cell. As the null hypothesis assumes that the two variables are independent of each other, the expected frequencies are calculated using the multiplication rule of the probability according to which, the probability of the occurrence of two independent events X and Y is the product of the individual probabilities of X and Y. Degrees of freedom are the number of independent variables in the data set. The degree of freedom for a contingency table is the product of the numbers got by subtracting 1 from the number of rows and columns, i.e., (number of rows-1) × (number of columns-1). The value of the χ2 statistic is compared with the appropriate χ2 distribution. As the degrees of freedom increases, the χ2 value required to reject the null hypothesis increases. For the χ2 test of independence, a p-value less than or equal to 0.05 is commonly interpreted as justification for rejecting the null hypothesis (row variable is unrelated to column variable). The alternative hypothesis in the test of independence states that both the row and column variables are associated.

2.5.3 Residual analysis of the χ2 test

The null hypothesis for the Pearson‘s χ2 test of independence used to test the independency of row and column variables for an I x J contingency table, if rejected, indicates an association between the row and the column variables. Residuals of the Pearson‘s χ2 test can be followed up by residual analysis to assess where association lies. Residual analyses are usually performed with standardized residuals, which is the difference between observed and expected values in a cell divided by a standard error for the difference. Standardized residuals help to find the direction and strength of the association. A large standardized residual provides evidence of association in that cell. Residual analyses shows that the standardized residuals having a positive value points to the fact that the cell was overrepresented in the actual sample taking into account the expected frequency i.e., the observed number was greater for this category than it was expected. Standardized residuals having a positive value points to the fact that the cell was overrepresented in the actual sample taking into account the expected frequency, i.e., the observed number was greater for this category than it was expected. Standardized residuals with negative value indicate that the cells were overrepresented in the actual sample by comparing it to the expected frequency i.e., the number of subjects in this category was fewer than the expected number.

Need help with your literature review?

Our qualified researchers are here to help. Click on the button below to find out more: