Options

Query and background samples

Composition Profiler provides an easy way to visually investigate bias
in amino acid composition between two sets of protein sequences. A set
of proteins under study (the query sample) can be analyzed
against a representative sample of proteins from the organism under
study, or a group of proteins with a contrasting functional annotation
(the background sample), which provides a suitable background
amino acid distribution.

There are no theoretical limits on the number of sequences that can be
given as input. Composition Profiler treats differences in amino acids
composition as "signals" – since the p-value is a function of the sample
size and signal strength, samples which are not large enough for the
inherent signal will give differences with p-value above the statistical
significance threshold, and will be discarded as spurious. With small
sample sizes, only the strongest signals will be identified.

For example, if a sample consisting of only one sequence, AAAAAAAAAA,
were to be analyzed against SwissProt, because the difference between
100% A in the sample and 7.89% A in SwissProt presents a very strong
signal (12 fold enrichment), the difference will be statistically
significant. Larger sample size allows identification of more subtle signals.

Background distribution

Alternatively, the query sample can be analyzed against one of the standard protein datasets:

SwissProt 51 (Bairoch et al., 2005), closest to the distribution of amino acids in nature among the four datasets

PDB Select 25 (Berman et al, 2000), a subset of structures from the Protein Data Bank with less than 25% sequence identity, biased towards the composition of proteins amenable to crystallization studies

Surface residues determined by the Molecular Surface Package over a sample of PDB structures of monomeric proteins, suitable for analyzing phenomena on protein surfaces, such as binding

In order to expedite the calcutions, amino acid compositions of the standard datasets have been pre-computed, as means and standard deviations over 100,000 bootstrap iterations.

Residue \ %

SwissProt

PDB S25

Surface Residues

DisProt

Ala (A)

7.89 ± 0.05

7.70 ± 0.08

6.03 ± 0.13

8.10 ± 0.35

Arg (R)

5.40 ± 0.04

4.93 ± 0.06

6.56 ± 0.13

4.82 ± 0.23

Asn (N)

4.13 ± 0.04

4.58 ± 0.06

6.23 ± 0.15

3.82 ± 0.27

Asp (D)

5.35 ± 0.03

5.83 ± 0.05

8.18 ± 0.10

5.80 ± 0.30

Cys (C)

1.50 ± 0.02

1.74 ± 0.05

0.78 ± 0.04

0.80 ± 0.08

Gln (Q)

3.95 ± 0.03

3.95 ± 0.05

5.21 ± 0.09

5.27 ± 0.37

Glu (E)

6.67 ± 0.04

6.65 ± 0.07

8.70 ± 0.17

9.89 ± 0.61

Gly (G)

6.96 ± 0.04

7.16 ± 0.07

7.06 ± 0.11

7.41 ± 0.40

His (H)

2.29 ± 0.02

2.41 ± 0.04

2.60 ± 0.06

1.93 ± 0.11

Ile (I)

5.90 ± 0.04

5.61 ± 0.06

2.77 ± 0.07

3.24 ± 0.13

Leu (L)

9.65 ± 0.04

8.68 ± 0.08

5.11 ± 0.08

6.22 ± 0.25

Lys (K)

5.92 ± 0.05

6.37 ± 0.08

9.75 ± 0.16

7.85 ± 0.45

Met (M)

2.38 ± 0.02

2.22 ± 0.04

1.13 ± 0.04

1.87 ± 0.10

Phe (F)

3.96 ± 0.03

3.98 ± 0.04

2.38 ± 0.05

2.44 ± 0.13

Pro (P)

4.83 ± 0.03

4.57 ± 0.05

5.63 ± 0.10

8.11 ± 0.63

Ser (S)

6.83 ± 0.04

6.19 ± 0.06

6.87 ± 0.13

8.65 ± 0.43

Thr (T)

5.41 ± 0.02

5.63 ± 0.05

6.08 ± 0.11

5.56 ± 0.24

Trp (W)

1.13 ± 0.01

1.44 ± 0.03

1.33 ± 0.05

0.67 ± 0.06

Tyr (Y)

3.03 ± 0.02

3.50 ± 0.04

3.58 ± 0.08

2.13 ± 0.15

Val (V)

6.73 ± 0.03

6.72 ± 0.06

4.01 ± 0.06

5.41 ± 0.44

To illustrate the importance of choosing an appropriate background distribution,
we generated composition profiles of PDBS25 (A), surface residues dataset
(B) and DisProt (C) against SwissProt:

All three graphs have the same y-axis scale, the same order of amino acids
and the same color-coding scheme (Vihinen flexibility), which allows direct
visual comparison of the amino acid biases in the three datasets.

Discovery

Alpha value

Statistical significance associated with a specific enrichment or depletion is estimated using the two-sample t-test between two sequences of binary indicator variables, one sequence for each of the samples. A particular enrichment or depletion is statistically significant when p-value (the lowest value at which the null hypothesis that the same underlying Gaussian distribution generated both samples can be rejected), is lower than or equal to a user-specified statistical significance (alpha) value.

Bonferroni correction

A correction of the alpha value in cases when multiple hypotheses are tested. See (Weisstein) for
details.

Number of bootstrap iterations

In the context of calculating composition differences, bootstrapping
is used for non-parametric estimation of the confidence intervals for
the reported amino acid compositions. More precisely, reported
compositions are means of pseudo-replicate compositions, and
confidence intervals are standard deviations of pseudo-replicate
compositions.

In the context of relative entropy calculations, bootstrapping is used
to estimate the statistical significance of the observed relative
entropy. In each iteration, random samples of the two starting samples
are created, relative entropy between the random samples is computed
and compared to the observed relative entropy.

The observed relative entropy is independent of the number of bootstrap
interations, and is relatively fast to compute. Bulk of the time spent
on the calculations goes towards determining the statistical
significance, and for large datasets, this time may be considerable.
We therefore advise to start the comparison with a small number of
iterations to get a rough estimate for the p-value, and to increase
the number of iterations for comparisons which have a p-value below the
threshold of 1 / number of iterations.

We here provide an example of running times for the relative entropy
calculations between the alpha MoRF dataset and a sample of proteins
from SwissProt (both datasets are provided as part of the program
distribution).

Iterations

10

100

1,000

10,000

100,000

Running time

1s

3s

23s

3m 40s

36m 38s

In principle, if we disregard the short initialization period when amino
acids are counted and the counts stored, the running time grows linearly
with the number of iterations.