"Every solution will only lead to new problems."

Saturday, 1. November 2008

In the first part of this series I showed a naïve algorithm for the Damerau-Levenshtein distance which needs O(m*n) space. In the last post I improved the algorithm to use only O(m+n) space. This time I will show a more functional implementation which uses only immutable F#-Lists and works still in O(m+n) space. This version doesn’t need any mutable data.

Last time I showed a naïve implementation of the Damerau-Levenshtein-Distance in F# that needs O(m*n) space. This is really bad if we want to compute the edit distance of large sequences (e.g. DNA sequences). If we look at the algorithm we can easily see that only the last two lines of the (n*m)-matrix are used. This observation leads to a improvement where we compute the distance with only 3 additional arrays of size min(n,m).

Friday, 31. October 2008

Today I am publishing an algorithm for calculating the Damerau-Levenshtein distance in F#. The Levenshtein distance is a metric that allows to measure the amount of difference between two sequences and shows how many edit operations (insert, delete, substitution) are needed to transform one sequence into the other. The Damerau-Levenshtein distance allows the transposition of two characters as an operation. It is often used for spelling corrections or to measure the variation (“edit distance”) between DNA sequences.

let damerauLevenshtein(a:'a array) (b:'a array) =

let init i j =

if j = 0 then i

elif i = 0 then j else 0

let n = a.Length + 1

let m = b.Length + 1

let d = Array2.init n m init

for i in [1..a.Length] do

for j in [1..b.Length] do

let cost =

if a.[i-1] = b.[j-1] then 0 else 1

let deletion = d.[i-1, j] + 1

let insertion = d.[i,j-1] + 1

let substitution = d.[i-1,j-1] + cost

d.[i, j] <-

deletion

|> min insertion

|> min substitution

if i > 1 && j > 1 && a.[i-1] = b.[j-2] &&

a.[i-2] = b.[j-1] then

let transposition = d.[i-2,j-2] + cost

d.[i, j] <- min d.[i,j] transposition

d.[a.Length, b.Length]

This naïve implementation needs quadratic space (O(m*n)). Since the algorithm is used to calculate the edit distance of large DNA sequences this is extremly bad. Next time I will show how we can get linear space (O(m+n)) for the algorithm.