Reading a paper about gene evolution, I see that they do phylogenetic analysis for bacteria using protein sequences. They take the method from another paper.

I can suspect that amino-acid sequences are more stable than nucleotide sequences, to the presence of synonymous substitutions.... but, is this stability required between closely related species? doesn't it make the analysis less powerful? does it make it more reliable? In other words, what's the advantage of using amino-acid sequences versus using nucleotide sequences for phylogenetic analysis?

2 Answers
2

In general, many sequence alignment programs can use multiple substitution models, distinguishing between nucleotides, amino acids, and codons. A protein sequence has functional information that is not directly visible in the nucleotide sequence.

The papers you link deal with horizontal gene transfer, where a gene is passed to more distant organism. Different species have different codon usage biases, i.e. the translation efficiency is different for different codons. On one hand, this means that HGT is more likely to occur between species of similar codon usage. On the other hand, "codon usage of horizontally transferred genes approaches the host usage over time." Thus, on the nucleotide level, the phylogenetic signal will get lost due to the evolutionary pressure on translation efficiency, while on the protein level, there will be more conservation.

Due to the fact that several codons can code a same amino-acid, the amino-acid sequence is usually more conserved than the nucleotide sequence.

For small scale studies, the higher variability of nucleotide data brings useful characters to establish relationships between closely related organisms that might not be differentiated at the amino-acid level.

With long evolutionary distance, the nucleotide signal tends to become erased by multiple substitutions at a same site. A more annoying feature is that genomes tend to have a preferred nucleotide composition. A site undergoing substitutions will have an increased probability to display of the preferred nucleotide, particularly if the substitution has no effect at the amino-acid level (synonymous substitution). This mainly affects third codon positions, because, as you can see in the genetic code, this is where most codon families vary. It may happen that distant species share a same composition preference. The sites free to vary will then tend to display the same nucleotide in both species as the number of substitution increases. This can induce phylogeny reconstruction errors, especially if the models of nucleotide evolution are not sophisticated enough.

The larger the evolutionary scale, the higher the chance that such misleading features will appear in the nucleotide data. This makes amino-acid more suited to large scale studies. But ultimately, one could hope that using better models will enable to use the signal present in the nucleotide data to its fullest without being too much mislead. Using amino-acid data amounts to discarding some of the information present.

It should be noted that some methods use codon evolution models instead of nucleotide or amino-acid models: all the signal is kept, but it is possible to incorporate the knowledge that some codons are more likely to transform into one another, due to synonymy.