Bioinfomatics Research Group

Computer Science, The University of Hong Kong

Project Description

MetaCluster is an unsupervised binning method for metagenomic
sequences.Existing binning methods based on sequence similarity and
sequence composition markers rely heavily on the reference genomes of
known microorganisms and phylogenetic markers. While MetaCluster is an
integrated binning method based on the unsupervised top-down separation
and bottom-up merging strategy, it can bin metagenomic sequencing
datasets with mixed complex species abundance ratios from the exactly
equal situation to the extremely unbalanced situation with consistently
higher accuracy when compared with other recently reported methods.

Limited by the laboratory technique, traditional microorganism
research usually focuses on one single individual species. This
significantly limits the deep analysis of intricate biological
processes among complex microorganism communities. With the
rapid development of genome sequencing techniques, the
traditional research methods of microorganisms based on the
isolation and cultivation are gradually replaced by metagenomics,
also known as environmental genomics. The first step, which is
also the major bottleneck of metagenomic data analysis, is the
identification and taxonomic characterization of the DNA
fragments (reads) resulting from sequencing a sample of mixed
species. This step is usually referred as “binning”.
Existing binning methods based on sequence similarity and
sequence composition markers rely heavily on the reference
genomes of known microorganisms and phylogenetic markers.
Due to the limited availability of reference genomes and the bias
and unstableness of markers, these methods may not be applicable
in all cases. Not much unsupervised binning methods are reported,
but the unsupervised nature of these methods makes them
extremely difficult to annotate the clusters with taxonomic labels.
In this paper, we present MetaCluster 2.0, an unsupervised
binning method which could bin metagenomic sequencing
datasets with high accuracy, and also identify unknown genomes
and annotate them with proper taxonomic labels. The running
time of MetaCluster 2.0 is at least 30 times faster than existing
binning algorithms.