Abstract

Next generation sequencing has made it possible to perform differential gene expression
studies in non-model organisms. For these studies, the need for a reference genome
is circumvented by performing de novo assembly on the RNA-seq data. However, transcriptome assembly produces a multitude
of contigs, which must be clustered into genes prior to differential gene expression
detection. Here we present Corset, a method that hierarchically clusters contigs using
shared reads and expression, then summarizes read counts to clusters, ready for statistical
testing. Using a range of metrics, we demonstrate that Corset out-performs alternative
methods. Corset is available from https://code.google.com/p/corset-project/webcite.