STR detection supplemental material:
------------------------------------
To detect STRs we coded our own algorithm implemented in C.
It detects STRs of minimal length L and periodicity P.
The algorithm is based on the comparison of a letter at position(i)
with the letters at position (i+k*P) where k*P < L as illustrated in table I.
Table I. Detection of STR with P=3 and L=15
C A A C A A C A A C A T C A A
C 1 1 1 1 1 = 5
A 1 1 1 1 1 = 5
A 1 1 1 -1 1 = 3
C 1 1 1 1 1 = 5
A 1 1 1 1 1 = 5
A 1 1 1 -1 1 = 3
C 1 1 1 1 1 = 5
A 1 1 1 1 1 = 5
A 1 1 1 -1 1 = 3
C 1 1 1 1 1 = 5
A 1 1 1 1 1 = 5
T -1 -1 -1 1 -1 = -3 <= mismatch
C 1 1 1 1 1 = 5
A 1 1 1 1 1 = 5
A 1 1 1 -1 1 = 3
The matrix values are +1 when the letters match or -1 otherwise. This way only
when a letter agrees with the majority of letters will the sum be positive.
A negative sum points to letters that are not in the consensus and are called
mismatch letters. We say we have an STR of periodicity P, length L and mismatch window W
when there is no more than 1 mismatch for any window of length W along its sequence.
If an STR is found of length L it is expanded by adding the next letters at the right
end of the STR and computing the new sums until the mismatch level is exceeded. At this
point the STR is stored and the left side of the STR is moved forward until the
another STR is found or the length is L again. The process is continued along the
whole sequence.
It happens that 2 STRs overlap as shown in this example where P=3,L=12 and W=10.
seq: ATGCAACAACAACTACAAGAAGAAGAACAACAACAACAATTG
str1 ---CAACAACAACTACAA------------------------
str2 --------------ACAAGAAGAAGAACAA------------
str3 ----------------------AAGAACAACAACAACAA---
merged: CAACAACAACTACAAGAAGAAGAACAACAACAACAA
It this case the strs are merged as a single STR domain.
STRs patterns are filtered for symetry to avoid counting str(P=3) in the str(P=6)
category.
Significance threshold
----------------------
Since short STRs can occur by chance we establish the significant threshold by comparing
the number of STRs of periodicity P and exact length L found in a randomized genome againts
the number found in the real genome. The randomized genome is produced by permutating randomly
every letters of the real genome.
As illustrated in table II when the length of the STR is short the number of
STRs found in both sets is similar. When the length increases the random set number
diminishes faster then the real set. We establish the significant cutoff at the length where
the ratio of random STRS over real STRs is less then .05 ie of all the STRs in the real set
any STR has less then 5% of occuring by chance.
table II.
Number of pure STR (no mismatch) of periodicity P=3 of exact length L in both a
randomized genome and the real genome of Candida Albicans (ca19)
L random real randomness
-----------------------------------
6 131429 152179 0.86
7 39909 56351 0.71
8 12216 25047 0.49
9 3285 9485 0.35
10 1033 4574 0.23
11 311 3124 0.10
12 84 1481 0.06
13 30 994 0.03 5%).
On the other hand a genome with no STRs will contain STRs in similar numbers
as the randomed set. Our software would find no acceptable threshold level in those cases which
reflects the actual situation.
This method can be extended to establish threshold STR length when a mismatch is tolerated per
window W. The overall threshold levels for C.Albicans are summurized in table III.
table III.
P=3 L=16 W=16 EW=1 PN=99.266911 Z0=6002 ZR=44
P=3 L=18 W=15 EW=1 PN=99.724074 Z0=4349 ZR=12
P=3 L=19 W=14 EW=1 PN=99.802059 Z0=3789 ZR=7
P=3 L=19 W=13 EW=1 PN=99.808722 Z0=3921 ZR=7
P=3 L=18 W=12 EW=1 PN=99.582229 Z0=4907 ZR=20
P=3 L=18 W=11 EW=1 PN=99.536581 Z0=5071 ZR=23
P=3 L=19 W=10 EW=1 PN=99.712710 Z0=4351 ZR=12
P=3 L=18 W=9 EW=1 PN=99.419643 Z0=5600 ZR=32
P=3 L=22 W=8 EW=1 PN=99.925981 Z0=2702 ZR=2
P=3 L=22 W=7 EW=1 PN=99.894068 Z0=2832 ZR=3
P=3 L=21 W=6 EW=1 PN=99.728132 Z0=4230 ZR=11
P=3 L=23 W=5 EW=1 PN=99.815385 Z0=3250 ZR=6
P=3 L=24 W=4 EW=1 PN=99.656239 Z0=2909 ZR=10
The set of all STRs of periodicity P with No mismatch or with 1 mismatch per window W where W > P
that satisfy the above threshold level is called the STR95 set.