An atlas of the thioredoxin
fold class reveals the complexity of function-enabling
adaptations

Background
The thioredoxin (Trx) fold class is huge and diverse.
Assessment of the variation in catalytic machinery of Trx
fold proteins is essential in providing a foundation for
understanding their functional diversity and predicting the
function of its many uncharacterized members.

Methodology/Principal Findings
The proteins of the Trx fold class retain common features
— including variations on a dithiol CxxC active site
motif — that lead to delivery of function. We use
protein similarity networks guide an analysis of how
structural and sequence motifs track with catalytic
function and taxonomic categories for 4,082 representative
sequences spanning the known superfamilies of the Trx fold.
Domain structure in the fold class is varied and modular,
with 2.8% of sequences containing more than one Trx fold
domain. Most member proteins are bacterial. The fold class
exhibits many modifications to the CxxC active site motif
— only 56.8% of proteins have both cysteines, and no
functional groupings have absolute conservation of the
expected catalytic motif. Only a small fraction of Trx
fold sequences have been functionally characterized.

Conclusions & significance
This work provides a global view of the complex
distribution of domains and catalytic machinery throughout
the fold class, showing that each superfamily contains
remnants of the CxxC active site. The unifying context
provided by this work can guide the comparison of members
of different Trx fold superfamilies to gain insight about
their structure-function relationships, illustrated here
with the thioredoxins and peroxiredoxins.

A. Structure-similarity network, containing 159
structures that are a maximum of 60% identical (by sequence)
that span the Trx fold class. Similarity is defined by FAST
scores better than a score of 4.5; edges at this limiting score
represent alignments with a median of 2.75Å RMSD across 72
aligned positions. Each node is colored by a PFAM
Thioredoxin-like Clan family if the chain sequence is a member
of that family. Nodes with thick red borders and bold labels
denote chains present in the hierarchical clustering tree in D.
Labels like 1ON4_A denote PDB ID 1ON4, chain A. B.
Structure similarity network containing the same structures as
in A, shown at the more stringent threshold of 7.5. Edges at
this limiting score correspond to alignments with a median of
2.45Å RMSD across 89 aligned positions. Nodes are colored as in
A. C. Structure similarity network containing the 105
structures from the large connected cluster in B, displayed at
a FAST score cutoff of 12.0; edges at this limiting score
represent alignments with a median of 2.21Å RMSD across 102
aligned positions. Nodes are colored as in A. D.
Complete linkage hierarchical clustering tree based on pairwise
FAST scores for 15 representative structures singled out in the
networks in A-C, with PDB IDs in bold, and associated SWISSPROT
sequence IDs in plain text.

Sequence similarity network, containing 4,082 representative
sequences that are a maximum of 40% identical that span the Trx
fold class. Similarity is defined by pairwise BLAST alignments
better than an E-value of 1x10-12; edges at this threshold
represent alignments with a median 30% identity over 120
residues, while the rest of the edges represent better
alignments. Each node is colored by the sequences SWISSPROT
family classification, if available; sequences that are not
classified in SWISSPROT are colored grey. Large nodes represent
sequences that are at least 40% identical to the 159 structures
in Fig. 3. The sequences associated with the 15 representative
structures in Fig. 3C are labeled using bold text and white
arrows. The general locations of other sequences representing
different superfamilies are noted using italicized text.

A. Sequence similarity network, containing 4,082
representative sequences that are a maximum of 40% identical
that span the Trx fold class. Similarity is defined by pairwise
BLAST alignments better than an E-value of 1x10-12; edges at
this threshold represent alignments with a median 30% identity
over 120 residues, while the rest of the edges represent better
alignments. Nodes are colored by the number of PFAM
Thioredoxin-like Clan family domains occurring within the
sequence; with the exception of H. influenzae Prx 5 -- labeled
(iii) -- and the monothiol glutaredoxins -- labeled (ii) --
these domains are typically duplications of the same domain,
such as the PDI-type enzymes (iv), which can contain two to
four thioredoxin domains, or the few DSBA-like enzymes (i)
which contain up to three DSBA-like domains. Large nodes
represent sequences that are at least 40% identical to the 159
structures in Fig. 3. The sequences associated with the 15
representative structures in Fig 3C are labeled using bold text
and white arrows. The occurrence of other sequences
representing different superfamilies are noted using italicized
text. B. Domain structures for example sequences from
the groups labeled (i)-(iv); some domains are shorter than
expected and this is denoted by a gradient that fades to white.
The sequences are identified by their UNIPROT sequence IDs.

A. 4,082 representative sequences that are a maximum of
40% identical and span the Trx fold class, binned according to
their membership in PFAM families within the Thioredoxin-like
Clan. B. All 29,206 sequences in the Trx fold class.

Fig. S5.
There is good correspondence between the structure and
sequence-based Trx fold class networks

The three views of the structure-based network from Fig. 3 are
repeated in A-C, and panel D contains a sequence-based network
derived from the amino acid sequences in the 159 structure
chains. A. Structure similarity network, containing 159
structures that are a maximum of 60% identical (by sequence)
that span the Trx fold class. Similarity is defined by FAST
scores better than a score of 4.5; edges at this threshold
represent alignments with a median of 2.75Å RMSD across 72
aligned positions, while the rest of the edges represent better
alignments. Each node is colored by a PFAM Thioredoxin-like
Clan family if the chain sequence is a member. Nodes with thick
white borders and bold labels denote chains present in the
hierarchical clustering tree in Fig. 3D. Labels like 1ON4_A
denote PDB ID 1ON4, chain A. B. Structure similarity
network containing the same structures as in A, shown at the
more stringent threshold of 7.5. Edges at this threshold
correspond to alignments with a median of 2.45Å RMSD across 89
aligned positions. Nodes are colored as in A. C.
Structure similarity network containing the 105 structures from
the large connected cluster in B, displayed at a FAST score
cutoff of 12.0; edges at this threshold represent alignments
with a median of 2.21Å RMSD across 102 aligned positions. Nodes
are colored as in A. D. Sequence similarity network,
containing 159 chain sequences from A-C. Similarity is defined
by pairwise BLAST alignments better than an E-value of 1x10-5;
edges at this threshold represent alignments with a median 27%
identity over 84 residues, while the rest of the edges
represent better alignments.

Fig. S6.
Use of some members of the Trx fold class is restricted to
taxonomic subsets

Here, the sequence similarity network from Fig. 4, containing
4,082 sequences, is colored by the species kingdom (Metazoa,
Fungi, Viridiplantae) or superkingdom (Bacteria, Eukaryota,
Archaea). Note that Eukaryota includes all eukyaryotic species
without a more specific kingdom, and is primarily associated
with protozoan parasites. Large nodes represent sequences that
are associated with the structures from Fig. 3. Blue letter
labels correspond to sequence groups in Fig. 5.

Structure-similarity network, containing 159 structures that are
a maximum of 60% identical (by sequence) that span the Trx fold
class. Similarity is defined by FAST scores better than a score of
4.5; edges at this threshold represent alignments with a median of
2.75Å RMSD across 72 aligned positions, while the rest of the edges
represent better alignments.

Load in Cytoscape using File: Import network

Attribute:

PDB chain

Associated SwissProt sequence info

Annotation of PDB chain sequence: PFAM Trx Clan
HMMs

Description

PDB chain

ID

PDB ID_Chain ID

pdb_chainID

PDB ID_Chain ID

pdbID

PDB ID

comment

PDB structure "title"

exp_method

NMR or X-ray

resolution

if X-ray, resolution; 0.0 if
NMR

year

date of deposition in the PDB

het_name

list of non-standard residues (often
ligands)

chain_seq

amino acid sequence of PDB
chain

chain_seq_length

length of 'chain_seq'

chain_seq_range

seq indices corresponding to
SwissProt sequence

chains

list of chain IDs found in entire PDB
structure

inClustTree

yes if in clustering tree in Fig.
1D

Associated SwissProt sequence
info

sp_ID

associated SwissProt sequence ID

chain_seq_range

SwissProt sequence indices
indicating coverage by chain seq

AC

UniProt accession

DE

UniProt definition line

species

species

taxID

NCBI taxonomy ID

SPFamily

SwissProt classification from
SIMILARITY line if exists

SPSequence

Full length SwissProt sequence assoc
w/ sp_ID

dbsource

sequences is in SwissProt (sp) or
TrEMBL (tr) database

seq_length

SwissProt sequence length

strucIDs

colon-separated list of PDB IDs
associated with this SwissProt ID in the PDB
database

struc_ct

number of PDB IDs associated with this
SwissProt ID in the PDB database

Structure-similarity network, containing 159 structures that are
a maximum of 60% identical (by sequence) that span the Trx fold
class. ... same structures as in A, shown at the more stringent
threshold of 7.5. Edges at this threshold correspond to alignments
with a median of 2.45Å RMSD across 89 aligned positions.

Structure-similarity network containing the 105 structures from
the large connected cluster in B, displayed at a FAST score cutoff
of 12.0; edges at this threshold represent alignments with a median
of 2.21Å RMSD across 102 aligned positions.

Sequence-similarity
network, containing 4,082 sequences that are a maximum of 40%
identical that span the Trx fold class. Similarity is defined by
pairwise BLAST alignments better than an E-value of
1x10-12; edges at this threshold represent alignments
with a median 30% identity over 120 residues, while the rest of the
edges represent better alignments.

1. Unzip file using gunzip (or double-click)

2. Load in xgmml file in Cytoscape using File: Import network

Note: This file is huge; will take at least minutes to load

Attribute:

General

Domains and PFAM Trx Clan HMMs

Structures

Sequence motifs

Taxonomic

Description

General

ID

UniProt:SwissProt/TrEMBL ID

name

UniProt:SwissProt/TrEMBL ID

AC

UniProt accession

DE

UniProt definition line

SPFamily

SwissProt classification from
SIMILARITY line if exists; Fig S2
coloring

colon-separated list of PDB IDs
associated with this sequence in the PDB
database

struc_ct

number of PDB IDs associated with this
sequence in the PDB database

idsFrom60nrPdbNet

List of PDB chains with >=
40% identity to this sequence in the 159-structure net
from Fig 1 (ie, chain sequences are also < 60%
identity to each other)

nStrucsFrom60nrPdbNet

=
count(idsFrom60nrPdbNet)

60nrPDB_ID

PDB ID for representative of this
sequence in structure-based network (Fig. 1) (ie,
member of the >= 40% identity sequence cluster
containing the sequence associated with that PDB
structure chain -- nearest by sequence from
'idsFrom60nrPdbNet')

Sequence motifs

catType

CxxC, Cxxc, cxxC, loopC_C3, other;
Fig. 6 coloring

C0

Amino acid at Cxxc position

X1

Amino acid at cXxc position

X2

Amino acid at cxXc position

C3

Amino acid at cxxC position

CXXC

'C0'+'X1'+'X2'+'C3'

loopC

Amino acid at Cxxxc position

cPorR

is there a Pro or Arg at the N-term of
the third beta strand? Fig. 7 coloring

Sequence-similarity network, containing 159 chain sequences from
Fig 1 A-C. Similarity is defined by pairwise BLAST alignments
better than an E-value of 1x10-5; edges at this threshold represent
alignments with a median 27% identity over 84 residues, while the
rest of the edges represent better alignments.