I'm working with a two letter alphabet $\{0,1\}$, and I'm talking about generalized sub-word i.e. letters don't need to be adjacent, $|01010|_{00} = 3$

For example, the two words $u=1001$ and $v=0110$ agree on all subwords of length $\leq$ 2 (1, 0, 10, 01, 11 and 00), and are both of length 4, which happens to be the shortest length where this property will be true with $k=2$. I've found the minimal lengths $\{2,4,7,12,16,22\}$ for $k=\{1,2,3,4,5,6\}$ respectively, through bruteforcing.

What I'm basically looking for is, for two words $|u|=|v|=n$, what bound you need on $k$ such that if $u$ and $v$ agree on all subwords of length $\leq k$, then $u=v$.

I'm aiming for a $\mathcal{O}(\log n)$ bound, so I've tried looking for an inductive proof on k where the length of the word becomes a multiple after each iteration.

Also, as a side proof, so far experimentally, I've found that if two words agree on all subwords of length $k$, they also agree on all of length $\leq k$, but can't quite find a proof or counter-proof for this.

@Eshan I'm trying to understand your definition. Why don't the words 1001 and 0110 agree on all subwords of length 2? I get counts of 1,1,2, and 2 for "00", "11", "01" and "10".
–
Byron SchmulandJun 18 '12 at 18:41

@ByronSchmuland Sorry, found a bug in my code. |w|=4 for k=2, |w|=7 for k=3 and |w|=12 for k=4. Sorry about the confusion, fixed question.
–
Ehsan KiaJun 18 '12 at 19:12

3

The last question is simple. The set of subwords can be generated recursively. Given a word of length $n$, there exist $n$ subwords of length $n-1$ each generated by removing one symbol of the word. Therefore the set of subwords with length $< k$ depends only on the set of subwords of length $k$, and not on the word itself.
–
Karolis JuodelėJun 18 '12 at 19:34

1

@KarolisJuodelė: This establishes equivalence as a set. To get multiset equivalence, we are also using regularity in the other direction: each subword of length $k-1$ extends to exactly $n-k+1$ subwords of length $k$, so the multiplicity of each $(k-1)$-subword is uniquely determined from the $k$-subwords.
–
Erick WongJun 21 '12 at 18:29

1 Answer
1

It is a natural problem of combinatorics on words which has already been studied.

We don't actually know a good asymptotic equivalent for $k(n)=1,3,6,11,\dots$. We have the upper bound $k(n)=O(\sqrt n)$, or more precisely:
$$k(n)\le 5+\left\lfloor\tfrac{16}{7}\sqrt{n}\right\rfloor\qquad\text{(Krasikov1997)}$$
$$k(n)\le 3+2\left\lfloor\sqrt{n\log 2}\right\rfloor\qquad\text{(Krasikov2000)}$$
(Krasikov2000) does not appear to be online (it is cited in (Ligeti2007)), and for some reason this better upper bound is not cited in (Dudik2002), so caveat emptor.

There is an easy $\Omega(\log n)$ lower bound from considering e.g. Thue-Morse words and their complement, but it is not sharp. In fact, $k(n)$ grows faster than any polynomial in $\log n$, so the bound you were hoping for is not satisfied:
$$\log k(n)=\Omega(\sqrt{\log n})$$
See (Dudik2002).

In a way these two bounds can be reconciled to give the admittedly vague equivalent
$$\log \log k(n)=\Theta(\log \log n)$$
but this is still precise enough to reject logarithmic growth, which would be $\Theta(\log \log \log n)$.