Bookmark

Computer Science > Data Structures and Algorithms

Title:
The streaming $k$-mismatch problem

Abstract: We consider the problem of approximate pattern matching in a stream. In the
streaming $k$-mismatch problem, we must compute all Hamming distances between a
pattern of length $n$ and successive $n$-length substrings of a longer text, as
long as the Hamming distance is at most $k$. The twin challenges of streaming
pattern matching derive from the need both to achieve small and typically
sublinear working space and also to have fast guaranteed running time for every
arriving symbol of the text.
As a preliminary step we first give a deterministic $O(k(\log \frac{n}{k} +
\log |\Sigma|) )$-bit encoding of all the alignments with Hamming distance at
most $k$ between a pattern and a text of length $O(n)$. (We denote the input
alphabet by the symbol $\Sigma$.) This result, which provides a solution for a
seemingly difficult combinatorial problem, may be of independent interest. We
then go on to give an $O(k\log^3 n\log\frac{n}{k})$-time streaming algorithm
for the $k$-mismatch streaming problem which uses only
$O(k\log{n}\log\frac{n}{k})$ bits of space. The space usage is within
logarithmic factors of optimal and approximately a factor of $k$ improvement
over the previous record [Clifford et al., SODA 2016]