This page contains answers for the most frequently asked questions that we receive at CATH and is the best place to starting looking if you have a question about anything to do with the CATH resource.

Please note, these documentation pages are currently in their infancy so there may be some questions that don't yet have answers. This means that we know the question is important and we will document the answer as soon as we can.

The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank. Protein structures are classified using a combination of automated and manual procedures. There are four major levels in this hierarchy:

For any given structure classified in the database, CATH gives you information on the structure and function of that protein. The evolutionary relationships involving the structure of interest and other proteins in the database can also be determined.

CATH also gives an overall view of the known protein structure universe to date. You can find which folds and superfamilies are the most populated, for example, and which structures are rare in nature.

Maintaining the CATH database is very much a team effort. Most of the members of the Orengo group have helped with the manual curation of the database and some have developed algorithms to aid with the automated aspects of maintaining and updating it.

Ian Sillitoe is the CATH Manager. Tony Lewis was a Research Assistant in the group and was heavily involved in the development of the new CATH update protocol as part of CATH v3.0. He is still involved in maintaining and updating CATH in an ongoing consultancy capacity. Natalie Dawson is the CATH curator and a Research Associate in the group. Sayoni Das is a Research Assistant in the group.

CATH is a tree-like, hierarchical classification that starts off at the tree “trunk” by clustering protein domains into broad categories (e.g. C, or class, where domains are clustered solely based on their general secondary structure content). As the hierarchy moves away from the “trunk” to the “branches”, more stringent clustering criteria are applied to provide clusters of domains with finer granularity of similarity.

Depth

Letter

Name

Clustering criteria

1

C

Class

Secondary structure content

2

A

Architecture

General spatial arrangement of secondary structures

3

T

Topology

Spatial arrangement and connectivity of secondary structures (fold)

4

H

Homologous Superfamily

Manual curation of evidence of evolutionary relationship (at least two criteria from sequence/structure/function must be observed)

5

S

Sequence Family (S35)

>= 35% sequence similarity

6

O

Orthologous Family (S60) *

>= 60% sequence similarity

7

L

“Like” domain (S95) *

>= 95% sequence similarity

8

I

Identical domain (S100)

100% sequence similarity

9

D

Domain counter

Unique domains

* We are aware that the names “Orthologous” and “Like” are by no means perfect descriptions of the clustering criteria that they represent. However we find it useful to provide some kind of label for these clusters and (quite frankly) these are the best we could come up with.

CATH is a hierarchical classification that clusters protein structures at differing levels of similarity. The first level, Class, clusters proteins based on their general secondary structure content and is represented by the first number in the CATH code (the 'C' column in the table below).

Domain

CATH code

C

A

T

H

S

O

L

I

D

1nr3A00

3.30.1190.10.1.1.1.1.1

3

30

1190

10

1

1

1

1

1

A more detailed explanation on the numbering involved in sequence clusters (SOLID levels) can be found in this blog entry .

For a particular CATH version, for example 3.2.0, the first number indicates the most recent major CATH database release (i.e. version 3.0.0), whilst the second number indicates a minor release. Version 3.2.0 is therefore the second update of the major CATH release 3.0.0. The third number is used for internal purposes.

A domain identifier is assigned to every classified domain in the CATH database. It consists of a 4-character PDB code, for example 1kcm, followed by the chain name, denoted by a letter, and a two-digit domain number. If there is only one chain, it will be assigned the letter A in the same way as the first chain in a multi-chain structure. If there is only one domain in the chain then 00 is used for the domain number. The structure 1kcm has only a single domain in a single chain; the domain identifier will therefore be 1kcmA00.

This was implemented due to the emergence of protein structures with more than nine domains. As experimental techniques for solving crystal structures have improved, the determination of protein structures with a large number of separate domains has increased.

The answer to this is use the CATH webservices. However, the CATH webservices are undergoing a major revamp and are still in testing. We will update this section when we move the webservices to production.