Thursday, February 14, 2013

Marine Klee-diagrams (1)

Paul Klee - Ancient Sound

I have touched on the topic before. While methods for phylogenetic visualization have typically been
developed for small phylogenetic trees, new methods are required as efforts to
resolve the tree of life proceed and sequence datasets grow. Many common software tools for visualising
small phylogenetic trees already exist.
These tools lay out trees in a two-dimensional Euclidean space and are useful
for visualising trees of up to a few hundred nodes. Some software tools (e.g. Figtree,HyperTree ), have increased the number of depictable nodes using 2D hyperbolic
space or visualising trees in 3D hyperbolic space (Walrus).

However, the larger datasets become,
the more difficult it will be to provide a panoramic view of the entire
dataset. Indicator vectors of sequences visualized in so called Klee-diagrams
have the potential to overcome this caveat of tree methods. I like to show a few examples here on the blog to demonstrate the power of Klee-diagrams displaying larger DNA Barcode datasets.

For my Klee-diagrams I utilized a mathematicalapproach to comparative analysis of nucleotide sequences using digitaltransformation in vector space. Essentially DNA data are transformed into vectors. A distinguishing vector which is indicative of a specific group of organisms can be calculated based on the transformed DNA sequence information (the number n of members in such a group can be defined). These so called indicator vectors can be constructed based on different taxonomic levels or other interesting groupings. All this is
implemented in a MatLab routine available at Mark Stoeckles barcode site.
Matrices of correlations among the indicator vectors can be displayed as
false-color maps (Klee-diagrams) using the MatLab graph functions. Note that the input could also be any other value set e.g. a distance matrix.

While the order of sequences in an alignment does not affect the actual
calculations, for the resulting Klee diagram it is useful to
arrange the sequences to approximate evolutionary relationships. Therefore, I organized the data based on the topology of Neighbor Joining trees (here constructed with MEGA 5). The re-ordering of my alignments was conducted
with a customized Tree Parser routine.

Over the last couple of years I dedicated a large amount of my time to marine DNA Barcoding and it seems logical to use some of the data we collected over 4 years to showcase the Klee-diagrams. By the way, all figures are hyperlinked (the captions) and available via figshare.

The figure shows a Klee-diagram that was constructed using marine COI barcodes
publicly available on BOLD. Blocks of high
correlation on the diagonal reflect affinities within groups of species,
corresponding to taxonomic divisions.

Major marine groups are clearly separated in the diagram. While COI
usually fails to resolve intermediate taxonomy it performs surprisingly well to
resolve the major marine phyla in this dataset. Rapidly evolving sites appear
saturated while more constrained sites are sufficiently variable to be
phylogenetically useful. Thus, it could be argued that the level of divergence
of genetic relationships examined here is for the most part located in windows
in which rapidly evolving sites are too saturated and slowly evolving sites are
variable enough to provide phylogenetic signal on two levels.

Klee diagrams utilizing only one gene fragment cannot replace in-depth
phylogenetic multi-gene analysis but it is conceivable that heat-map based
visualization can overcome the inadequatenesses of large scale trees.
Topologies generated through complex multi-gene algorithms could be translated
into such diagrams as well.

An advantage of the
method is its scalability. The figure below depicts a comparison
of two groups of marine invertebrates – echinoderms and polychaetes – based on
the COI gene. DNA Barcoding has been proven to be an effective, accurate and
useful method of species diagnosis for all five classes of Echinodermata . In addition our Klee diagram reveals
discontinuities corresponding to higher-level taxonomic divisions (left diagram). Furthermore,
some areas of high correlation are indicative of species groups that exhibit
low barcode divergence due to rather recent speciation events.

The diagram on the right shows that COI is not able to resolve the major groups
within the polychaeta. Traditionally, 18S rRNA has been used to provide
phylogenies that resolve the divisions within the polychaetes . Many species thought to have broad distributions turned out to be a complex of allied species and that this often rather reflects the
limitations of conventional taxonomy than actual cosmopolitanism. Also polychaetes
in general are thought to be paraphyletic and the lack of distinctness in the
diagram might reflect an overall unresolved taxonomy. However, it needs to be
pointed out that the method used to calculate the indicator vectors is based on
the rather arbitrary grouping by species identifications which could mask true
diversity patterns in some cases.