Let's say I have a set $S$, $(s_1, ..., s_i, ..., s_P) \in S$, of $P$ identical strings over a $k$-letter alphabet, each of length $|s_i| = L$. With uniform random probability across all strings in $S$ (and all string positions in any $s_i$), I randomly substitute one character for another. And I do so $N$ times.

For $N >> 1$, all $s_i$ will approximate random sequences. But what is the average Hamming distance between any two strings as a function of $N$?

For each character change, the average distance (for small N) goes up by 2/P with high probability. When N gets to be a significant fraction of LP, then the average distance probably increases with some form of exponential decay, again with high probability. For how big an N do you need to know this? Also with what confidence level? Gerhard "Probably Sure It's Mostly Correct" Paseman, 2011.05.24
–
Gerhard PasemanMay 25 '11 at 6:53

Gerhard, thanks for your comment! I'm interested in $N$ up to the point where the average Hamming distance between any two sequences is, on average, 0.5*L to 0.1*L.
–
user14324May 25 '11 at 7:09

A qualitative understanding of what a plot of Average Hamming Distance vs. "N" looks like is really what I'm after. I'm not really hopeful that anyone will be able to provide an explicit function...
–
user14324May 25 '11 at 7:11

You can plot the results provided by Prof. Israel. I believe they will reflect my remarks above. If you do some simulations of your model (which is but slightly different from what is answered), I will be surprised if you see any qualititative difference for k > 2. Even for k=2 I suspect the simulations will resemble the formula's prediction, so Prof Israel's answer will still give you the qualitative understanding you mention. Gerhard "Mostly Sure It's Probably Correct" Paseman, 2011.05.25
–
Gerhard PasemanMay 25 '11 at 7:47

1 Answer
1

At each move, I assume you choose one of the character positions in one of the strings (with equal probabilities for all), and replace the character in that position by a randomly chosen character (with equal probabilities for all - note that this allows the possibility that the new character is the same as the old one).
Let $X(n)$ be the event that after $n$ moves, the $i$'th character in string $j_1$ is the same as the $i$'th character in string $j_2$. Now if in move $n$ the position chosen was anything other than the $i$'th character in string $j_1$ or $j_2$, $X(n) = X(n-1)$, while if it was either of those, $X(n)$ has probability $1/k$. Thus ${\rm P}(X(n)) = (1 - \frac{2}{PL}) {\rm P}(X(n-1)) + \frac{2}{PLk}$ with ${\rm P}(X(0)) = 1$. The solution of this recurrence is
${\rm P}(X(n)) = \frac{1}{k} + \frac{k-1}{k} \left( 1-\frac{2}{PL} \right)^n$.
The expected Hamming distance between strings $j_1$ and $j_2$ after $n$ moves is
$L (1 - {\rm P}(X(n)))$.

(added in response to unknown(yahoo)'s further question): if the new character must be different from the existing one at that position, the recurrence becomes ${\rm P}(X(n)) = (1 - \frac{2}{PL}) {\rm P}(X(n-1)) + \frac{2}{PL} \frac{1-P(X(n-1))}{k-1}$, and the solution is ${\rm P}(X(n)) = \frac{1}{k} + \frac{k-1}{k} \left(1 - \frac{2k}{PL(k-1)}\right)^n$. Again the expected Hamming distance after $n$ moves is $L (1 - {\rm P}(X(n)))$.

...and once again I underestimate the people here. =)
–
user14324May 25 '11 at 7:24

@Robert Israel, pressing my luck, "with equal probabilities for all - note that this allows the possibility that the new character is the same as the old one", what if we enforce a rule that a particular substitution must not reselect the character that exists are the randomly selected string position?
–
user14324May 25 '11 at 7:28

Another way to get this is that the strings are identical in a position if neither has been altered there, which happens with probability $p = (1-2/(PL))^n$, or if at least one has been altered and you got lucky, probability $(1-p)/k$, total $p+(1-p)/k$. The average Hamming distance can be computed from these linearly.
–
Douglas ZareMay 25 '11 at 7:33