Contents

Welcome to the CD-HIT Project Main Page

News (September 2009)CD-HIT web server is now available
to run cd-hit or download pre-calculated clusters.

CD-HIT stands for Cluster Database at High Identity with Tolerance.
The program (cd-hit) takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output.
In addition cd-hit outputs a cluster file, documenting the sequence 'groupies' for each nr sequence representative.
The idea is to reduce the overall size of the database without removing any sequence information by only removing 'redundant' (or highly similar) sequences.
This is why the resulting database is called non-redundant (nr).
Essentially, cd-hit produces a set of closely related protein families from a given fasta sequence database.

CD-HIT uses a 'longest sequence first' list removal algorithm to remove sequences above a certain identity threshold.
Additionally the algorithm implements a very fast heuristic to find high identity segments between sequences, and so can avoid many costly full alignments.

With recent developments, cd-hit package offers new programs for DNA sequence clustering and comparing two databases. It also has lots of new options for clustering control.

Bugs

There are a number of outstanding bugs in the current implementation. We are always looking for hard working and enthusiastic volunteers (people like Luc Ducazu) to shoot these problems down.

Sub Projects

The CD-HIT project provides a number of opportunities for interesting research activities. If one of these sub-projects takes your interest why not join up and take part? We are especially keen to work closely with bioinformatics MSc students working on their MSc projects.