4
Two metric spaces Edit Distance Let Σ be a set of symbols, Σ n the set of all finite sequences (strings, or n-tuples) of characters from Σ Edit operations on an element of Σ n are the following: adding a character deleting a character replacing a character If for x, y in Σ n, if ed(x,y) is the minimum number of edit operations needed to transform x to y Then, is a metric space

5
Two metric spaces The Ulam metric of dimension n Let Σ, be as before, but let P n be the set of strings of n distinct characters from Σ, where n = |Σ|. Let Σ, be as before, but let P n be the set of strings of n distinct characters from Σ, where n = |Σ|. And if x, y are in P n, then define UL(x,y) to be the number of character moves needed to transform x to y. And if x, y are in P n, then define UL(x,y) to be the number of character moves needed to transform x to y. is a metric space. is a metric space. The above definitions are limited: we need pairs of strings with different characters, so The above definitions are limited: we need pairs of strings with different characters, so We let n be < |Σ| and instead of UL, we use ed. We can see that for x, y, UL(x,y) ≤ ed(x,y) ≤ 2 UL(x,y) We let n be < |Σ| and instead of UL, we use ed. We can see that for x, y, UL(x,y) ≤ ed(x,y) ≤ 2 UL(x,y)

6
Embeddings An embedding of a metric space into a target metric space is a mapping f : X → Y s.t. there are C, s real numbers such that for all x, y in X, An embedding of a metric space into a target metric space is a mapping f : X → Y s.t. there are C, s real numbers such that for all x, y in X, d(x, y) ≤ s∙m(f(x), f(y)) ≤ C∙d(x, y) d(x, y) ≤ s∙m(f(x), f(y)) ≤ C∙d(x, y) The minimum C that satisfies the above inequality for some s is called the distortion of the embedding f. The minimum C that satisfies the above inequality for some s is called the distortion of the embedding f.

7
Edit distance algorithm in O(n 2 ) Edit distance algorithm in O(n 2 ) If LCS(x,y) is the longest common subsequence between x and y, where x, y strings of length n, then If LCS(x,y) is the longest common subsequence between x and y, where x, y strings of length n, then n – LCS(x,y) ≤ ed(x,y) ≤ 2(n – LCS(x,y))

8
Theorem For every n, the Ulam metric of dimension n can be embedded into ℓ 1 O(|Σ| 2 ) with distortion O(logn). For every n, the Ulam metric of dimension n can be embedded into ℓ 1 O(|Σ| 2 ) with distortion O(logn). Let n be an integer, and lets suppose it is a power of 2, let m = |Σ|, so we can suppose that Σ = {1, 2, …, m}. The embedding is the following: Let n be an integer, and lets suppose it is a power of 2, let m = |Σ|, so we can suppose that Σ = {1, 2, …, m}. The embedding is the following:

9
The embedding The embedding is f : P n → ℓ 1 ( m 2 ) The embedding is f : P n → ℓ 1 ( m 2 ) Associate every coordinate of the target space with a distinct pair {a, b}, where a, b in Σ, and a ≠ b, and every permutation p in P n receives in the new space the following coordinates: Associate every coordinate of the target space with a distinct pair {a, b}, where a, b in Σ, and a ≠ b, and every permutation p in P n receives in the new space the following coordinates: f(p) {a, b} = 1/(p -1 (b) – p -1 (a)), if a, b appear in p, f(p) {a, b} = 1/(p -1 (b) – p -1 (a)), if a, b appear in p, f(p) {a, b} = 0, if they don’t. f(p) {a, b} = 0, if they don’t. The proof is given by the following two lemmas. The proof is given by the following two lemmas.

10
Lemma 1 - Expansion Let p and q be permutations of length n. Then, Let p and q be permutations of length n. Then, ║f(p) – f(q)║ 1 ≤ O(logn)∙ed(p, q) Proof: First notice that f can be extended to strings of length less than n. So, we only need to show the inequality to hold for the case ed(x, y) = 1, the size of x is n and of y is n – 1. Also, we will treat substitution as a character deletion and insertion.

12
Definitions needed LIS(p) LIS(p) breakpoint: a position i in [k-1] s.t. p[i] > p[i+1]. breakpoint: a position i in [k-1] s.t. p[i] > p[i+1]. b(p) : # of breakpoints in p. b(p) : # of breakpoints in p. p 0, p 1 are a partition of p if distinct and for all x of p, x appears in p 0 or p 1. p 0, p 1 are a partition of p if distinct and for all x of p, x appears in p 0 or p 1. block: a pair of positions {2i – 1, 2i}. block: a pair of positions {2i – 1, 2i}. a partition p 0, p 1 is block-balanced if they also partition every block with one element each. a partition p 0, p 1 is block-balanced if they also partition every block with one element each.

14
Argument points will try to augment LIS(p 0 ) with points from p 1. will try to augment LIS(p 0 ) with points from p 1. if j position in p 0, then, {j’, j} is a block if j position in p 0, then, {j’, j} is a block if j in LIS(p 0 ), then j’ is a candidate. if j in LIS(p 0 ), then j’ is a candidate. #candidates = LIS(p 0 ) #candidates = LIS(p 0 ) LIS(p 0 ) can always be augmented by LIS(p 0 ) – 2b(p). LIS(p 0 ) can always be augmented by LIS(p 0 ) – 2b(p). Every breakpoint can only be blamed for at most 2 candidates Every breakpoint can only be blamed for at most 2 candidates

20
Bounded-occurrence strings (B n, t, ed) embeds with distortion t into the Ulam metric of dimension n over an extended alphabet of size t|Σ|. Consequently, it embeds into ℓ 1 with distortion O(logn). (B n, t, ed) embeds with distortion t into the Ulam metric of dimension n over an extended alphabet of size t|Σ|. Consequently, it embeds into ℓ 1 with distortion O(logn). Just substitute a in Σ with a 1, a 2, …, a t and extend it to Σ’, of size t|Σ|. Just substitute a in Σ with a 1, a 2, …, a t and extend it to Σ’, of size t|Σ|. Substitute the j-th occurrence of a in x, with a j to have f(x). Substitute the j-th occurrence of a in x, with a j to have f(x). ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) follows. ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) follows.

21
Sketching t-non-repetitive strings For every k, there exists a polynomial-time sketching algorithm that solves the k vs Ω(k t logn) gap edit distance problem on t-non-repetitive strings of length n, using sketches of size O(1). We use the following: For every k, there exists a polynomial-time sketching algorithm that solves the k vs Ω(k t logn) gap edit distance problem on t-non-repetitive strings of length n, using sketches of size O(1). We use the following: For all k and ε > 0, there exists a polynomial- time sketching algorithm that solves the k vs (1+ε)k gap edit distance problem on binary of length n, using a sketch of size O(1/ε 2 ). For all k and ε > 0, there exists a polynomial- time sketching algorithm that solves the k vs (1+ε)k gap edit distance problem on binary of length n, using a sketch of size O(1/ε 2 ).

22
Sketching t-non-repetitive strings Convert ℓ 1 into Hamming metric: Convert ℓ 1 into Hamming metric: Round each coordinate to multiples of 1/Cn 2 for sufficiently large C > 0 (distortion increases by 2). Convert this to an element of the Hamming space… Convert this to an element of the Hamming space… Use sketching algorithm for Hamming distance Use sketching algorithm for Hamming distance

23
Locally non-repetitive strings For every t, and every k, there exists an embedding f of the (t, 180tk)-non-repetitive strings into ℓ 1, such that for every two strings x, y, For every t, and every k, there exists an embedding f of the (t, 180tk)-non-repetitive strings into ℓ 1, such that for every two strings x, y, Ω(min{k, ed(x, y)/(t log(tk))}) ≤ ║f(x) – f(y)║ 1 ≤ ed(x, y) Ω(min{k, ed(x, y)/(t log(tk))}) ≤ ║f(x) – f(y)║ 1 ≤ ed(x, y) …    Proof    …

26
Resulting… For every t, k, there exists a polynomial-time efficient sketching algorithm that solves the k vs Ω(t k logk) gap edit distance problem for (t,180tk)-non-repetitive strings using sketches of size O(1). For every t, k, there exists a polynomial-time efficient sketching algorithm that solves the k vs Ω(t k logk) gap edit distance problem for (t,180tk)-non-repetitive strings using sketches of size O(1). This improves a previous result and gives a sketching algorithm for the Ulam metric for this gap (with t = 1). This improves a previous result and gives a sketching algorithm for the Ulam metric for this gap (with t = 1).

27
Embed(x) (of the Ulam metric) (x is the inverse of the permutation – if a not in permutation, then x[a] = 0) A[1..m][1..m]: array of real; i, j : int Begin for i:=1 to m do for j:= 1 to i – 1 do if x[i]*x[j] <> 0 then A [j, i] := 1/(x[i] – x[j]) else A [j, i] := 0; output (A); End.