Navigation

KAT provides a suite of tools that, through the use of k-mer counts, help the user address or identify issues such as determining sequencing completeness for assembly, assessing sequencing bias, identifying contaminants, validating genomic assemblies and filtering content. KAT is geared primarily to work with high-coverage genomic reads from Illumina devices, although can work with any fasta or fastq sequence file.

At it’s core KAT exploits the concept of k-mer spectra (histograms plotting number of distinct k-mers at each frequency). By studying properties of the k-mer spectra it’s possible to discover important information about the data quality (level of errors, sequencing biases, completeness of sequencing coverage and potential contamination) and genomic complexity (size, karyotype, levels of heterozygosity and repeat content). Further information can be gleaned through pairwise comparison of spectra, making KAT useful for WGS library comparisons and assembly validation.

The K-mer counting itself, a critical element for all KAT tools, is accomplished through an integrated and modified version of Jellyfish2’s counting method http://www.genome.umd.edu/jellyfish.html. We selected Jellyfish for this task because it supports large K values and is one of the fastest k-mer counting programs currently available.

KAT supports Unix, linux or Mac systems. Windows, with something like cygwin, may work but hasn’t been
tested. A minimum of 8GB RAM, which will enable you to process small - medium sized datasets.
Large datasets will require more RAM (potentially a lot more), the actual amount of
memory required depends on the size of the genome’s to be processed, the k-mer size
selected and the size of your datasets.

We owe a big acknowledgment to all TGAC staff that has been bored eternally
with k-mers, you have all been incredible patient and supportive with us.

Thanks to Mario Caccamo, Sarah Ayling, Federica Di Palma and David Swarbreck for
all the support, feedback and encouragement.

Thanks to Richard Leggett, Daniel Zerbino and Zamin Iqbal for all the interesting
discussions, comments and input.

Thanks to Dan Sargent for the use of his P.micrantha datasets for tests, and
their inclusion as figures on this document.

Thanks to all the KAT early adopters users who have provided invaluable feedback
on the tool in its early stages: Paul Bailey, Jose De Vega, Rocio Enriquez-Gasca,
Marco Ferrarini and Dharanya Sampath. And more recently, those from further afield
who have contributed on github.

A big thanks to the author of jellyfish, Guillaume Marcais. Jellyfish is fantastic
piece of software and is critical to enabling KAT to do what it does in an efficient
and timely fashion.

Last but not least a very special thanks to the Lab guys on their white coats, trying to make
sense of all our comments, giving us better data each day and trying to get into
our heads all the complex explanations for the biases and extra variability we
were finding day after day.