11.2. Dynamic Time Warping

One of the earliest approaches to isolated word speech
recognition was to store a prototypical version of each word (called
a template) in the vocabulary and compare incoming speech with each
word, taking the closest match. This presents two problems: what
form do the templates take and how are they compared to incoming
signals.

The simplest form for a template is a sequence of feature
vectors -- that is the same form as the incoming speech. We will
assume this kind of template for the remainder of this discussion.
The template is a single utterance of the word selected to be
typical by some process; for example, by choosing the template which
best matches a cohort of training utterances.

Comparing the template with incoming speech might be achieved
via a pairwise comparison of the feature vectors in each. The total
distance between the sequences would be the sum or the mean of the
individual distances between feature vectors. The problem with this
approach is that if a constant window spacing is used, the lengths
of the input and stored sequences is unlikely to be the
same. Moreover, within a word, there will be variation in the length
of individual phonemes: Cassidy might be
uttered with a long /A/ and short final /i/ or with a short /A/ and
long /i/. The matching process needs to compensate for length
differences and take account of the non-linear nature of the length
differences within the words.

The Dynamic Time Warping algorithm achieves this goal; it
finds an optimal match between two sequences of feature vectors
which allows for streached and compressed sections of the
sequence. The paper by Sakoe and Chiba (Dynamic Programming
Algorithm Optimisation for Spoken Word Recognition) gives a detailed
description of the algorithm; I will summarise it here.

11.2.1. The DTW Grid

We can arrange the two sequences of observations on the sides
of a grid (Figure 11.1) with the unknown sequence
on the bottom (six observations in the example) and the stored
template up the left hand side (eight observations). Both
sequences start on the bottom left of the grid. Inside each cell
we can place a distance measure comparing the corresponding
elements of the two sequences.

Figure 11.1. An example DTW grid

To find the best match between these two sequences we can
find a path through the grid which minimises the total distance
between them. The path shown in blue in Figure 11.1
gives an example. Here, the first and second elements of each
sequence match together while the third element of the input also
matches best against the second element of the stored pattern.
This corresponds to a section of the stored pattern being stretched
in the input. Similarly, the fourth element of the input matches
both the second and third elements of the stored sequence: here a
section of the stored sequence has been compressed in the input
sequence. Once an overall best path has been found the total
distance between the two sequences can be calculated for this
stored template.

The procedure for computing this overall distance measure is
to find all possible routes through the grid and for each one of
these compute the overall distance. The overall distance is given
in Sakoe and Chiba, Equation 5, as the minimum of the sum of the
distances between individual elements on the path divided by the
sum of the warping function (which will be discussed later). The
division is to make paths of different lengths comparable.

It should be apparent that for any reasonably sized
sequences, the number of possible paths through the grid will be
very large. In addition, many of the distance measures could be
avoided since the first element of the input is unlikely to match
the last element of the template for example. The DTW algorithm is
designed to exploit some observations about the likely solution to
make the comparison between sequences more efficient.

11.2.2. Optimisations

The major optimisations to the DTW algorithm arise from
observations on the nature of good paths through the grid. These
are outlined in Sakoe and Chiba and can be summarised as:

Monotonic condition: the path will
not turn back on itself, both the i and j indexes either stay
the same or increase, they never decrease.

Continuity condition: The path
advances one step at a time. Both i and j can only increase by
1 on each step along the path.

Boundary condition: the path starts
at the bottom left and ends at the top right.

Adjustment window condition: a good
path is unlikely to wander very far from the diagonal. The
distance that the path is allowed to wander is the window length
r.

Slope constraint condition: The path
should not be too steep or too shallow. This prevents very short
sequences matching very long ones. The condition is expressed as
a ratio n/m where m is the number of steps in the x direction
and m is the number in the y direction. After m steps in x you
must make a step in y and vice versa.

By applying these observations we can restrict the moves that
can be made from any point in the path and so restrict the number
of paths that need to be considered. For example, with a slope
constraint of P=1, if a path has already moved one square up it
must next move either diagonally or to the right.

The power of the DTW algorithm goes beyond these observations
though. Instead of finding all possible routes through the grid
which satisfy these constraints, the DTW algorithm works by keeping
track of the cost of the best path to each point in the grid.
During the match process we have no idea which path is the lowest
cost path; but this can be traced back when we reach the end point.

11.2.3. The Weighting Function

A path through the grid is written in the paper as
F=c(1),c(2)...c(K), the generalised element of
the path is c(k) and this consists of a pair
of coordinates in the i (input) and
j (stored) directions. The i coordinate of the
kth path element is i(k).

The weighting function w(k) introduced
into the overall distance measure is used to normalise for the path
length. Two alternate weighting functions are presented: symmetric
and asymmetric. Both functions are derived from the distance
travelled (in grid units) in the last step of the path. The
symmetric form combines the i and j directions while the asymmetric
form uses just the i direction.

If the path has just made a diagonal step then i and j both
increase by 1 and the symmetric w(k) = 1+1 = 2; the asymmetric w(k)
= 1; The sum of this function over the length of the path gives a
measure of how long the path is. This is used in normalising the
overall distance measures.