With the command-line interface of CLC Assembly Cell, you can easily include these functionalities in scripts and other next generation sequencing workflows. It is easy to install on your desktop computer or a larger compute cluster.

The CLC Assembly Cell is intended for molecular biology applications. This product is not intended for the diagnosis, prevention, or treatment of a disease.

Benchmarking

We compared the performance of the industry standard HGAP1 when run on a high performance computer to the performance of a De Novo Assembly workflow in CLC Assembly Cell. Please note that our De Novo Assembly Pipeline was run on a standard laptop for this comparison.

Performance

Benchmarking

The latest version of CLC Assembly Cell introduces tools for error-correction and de novo assembly of raw PacBio reads. High quality assemblies can be generated in a fraction of the time that is needed by leading alternatives. CLC Assembly Cell consumes less than 10 percent of the memory used by alternative solutions, while completing the assembly faster.

Applications

CLC Assembly Cell is accelerated through advanced algorithm implementations: they use the SIMD instruction set to parallelize and accelerate compute intensive parts of the algorithms, and make the software one of the fastest and most accurate packages for NGS data analysis on the market.

Features in CLC Assembly Cell

Read mapping
•Read mapping of Illumina, Pacific Biosciences, Ion Torrent, SOLiD, and 454 sequencing data
•Native support for Color Space
•Support for both short read and long read assembly
•Support for both gapped and ungapped alignments when doing short read mapping
•Support for mapping of paired end reads

De novo assembly
•De novo assembly of Illumina, Pacific Biosciences, Ion Torrent and 454 sequencing data
•Support for both short read and long read assembly
•Support for de novo assembly of paired end data
•Building scaffolds from paired-end data

Other analyses
•Fast analysis of raw data, including reporting
•Option of joining data from different sources into the same analysis (including data generated by different kinds of sequencing technologies)
•Extraction of data from part(s) of an assembly. Examples are extraction of contig and reads from an area of interest, or extraction (exclusion) of data from a specific sequencing lane that is suspected not to be of acceptable quality.
•Removal of duplicate reads
•Quality trimming
•Find variations (simple SNP detection)
•Support for input file formats Fasta, Sff, GenBank, csfasta, and scarf
•A number of output options, including tables with assembly info (see pac bio benchmark data)

Services

Cluster support

Multiple CLC Assembly Cells can be run in parallel on a multi-node cluster.
In practice, almost every cluster is set up differently, and we therefore don’t provide an off-the-shelf solution that is guaranteed to work on your computer cluster. Instead we provide a free to download, free to use, and free to modify Perl script, as an example.

Job node distribution for CLC Assembly Cell

Multiple CLC Assembly Cells can be run in parallel on a multi-node cluster and as almost every cluster is set up differently, we provide the below free to download, free to use, and free to modify Perl script as an example. Please note that this is not an off-the-shelf solution that is guaranteed to work on your computer cluster but you are welcome to adjust it to fit your needs.

The script cluster_schedule distributes jobs defined in the schedule file on a number of nodes. An example could be distribution of CLC Assembly Cell reference assembly jobs. This requires an installation of CLC Assembly Cell on each node, and the best performance is reached if the reference sequence is stored locally on each node.

Each job is a list of commands which cluster_schedule will run in order on one node. If one of the commands in a job fails (error code is not zero) no more commands in the job is executed and the job is considered failed. If all commands in a job complete successfully (error codes are zero) the job is a success.
The nodes the jobs are run on can be defined on the command line or in the schedule_file. The nodes defined in command line replace all nodes defined in the schedule_file.

Each job is run on one node and each command is executed on the node using ssh.
Therefore, to use cluster_schedule make sure that all nodes are set up to use automatic ssh authentication.