Data smoothing is one of those processes that is easy to
implement with a glib formula, but has much more profound implications than most
users realise. In the following we assume that we start of with a set of
numbers, xk, that have resulted from sampling some process in
the real world, such as temperature, and that the interval between samples is T.

Technically, data smoothing is a form of low pass
filtering, which means that it blocks out the high frequency components (short
wiggles) in order to emphasis the low frequency ones (longer trends). There are
two popular forms; (a) the running mean (or moving average) and (b) the
exponentially weighted average. They are both implemented by means of efficient
recursive formulae:

In each case, from an original sequence of numbers, xk,
a new smoothed sequence, yk, is formed. In (a), as each new
number is added into the average, the one n before is dropped out of it,
so each new number is the average of the last n old numbers. In (b) a
fraction of the next number in the old sequence is added to the complementary
fraction of the last number in the new sequence, which means that each number
from the old sequence has less and less influence as it recedes into the past.

SPECIAL NOTE: It is NOT necessary to
recalculate a complete average for each new point. It is surprising how often
this is done. Even the Mathcad ® statistical tutorial falls into this trap. In
the running mean, smoothing a sequence of length L then results in (n-1)L
unnecessary calculations, which can be a very large number with strong smoothing
of long sequences, resulting in long calculation times.

Each of these methods has one parameter that must be
chosen. The value of n determines how many numbers from the old sequence
are averaged to produce each point in the new sequence. The value of b
determines the effective time constant of the filter (actually –T/ln(b)
).

Complications

1. Transient response

It is one of the implications of the uncertainty principle
that, when we take a finite block of a process to represent the whole of it,
there are unavoidable errors. In this case they take the form of the transient
response. You can demonstrate the transient response by putting the step
function test sequence (xk = 1,1,1,1,1,…..) into each formula. As this sequence is
already smooth, the ideal output should be the same as the input, but the running
mean ramps up to the value 1 over n samples, while the second formula
produces an exponential rise to the value 1 and never quite getting there. Thus
the running mean has the advantage that its transient response is finite in
length, though the errors in the exponential weighting formula become negligible
after a couple of time constants.

Various methods are used to overcome this problem in the
running mean without discarding the first n output points. One is to
taper the average, so that the first output point is an average of one, the
second an average of two etc. up to the nth point. This means
that the beginning of the output sequence is relatively unsmoothed, which can be
misleading. Another slightly better method is to precalculate the average of the
input sequence and pack n numbers equal to this value into the front of
the sequence. In either case, it is not desirable to make deductions from the
first n smoothed points.

2. Frequency response

The frequency response of the running mean formula is
actually rather complicated, taking the form of what is known as a sinc
function. This goes through a number of zeroes and a number of maxima as the
frequency increases. Here is the actual frequency response for n=5 and n=8:

We can see that some interfering frequencies can be
completely eliminated; yet a higher frequency is only reduced by a factor of 5.
Thus we have to be very careful about identifying apparent periodicities in data
smoothed by this method.

The exponentially weighted average drops smoothly to zero,
so does not have these problems.

3 Phase response

The running mean is what is known technically as a linear
phase filter, which means that, though all the frequency components are treated
with different gains, they are all delayed by the same length of time. The
exponentially weighted average does not have this property, so there is an extra
form of distortion of the shape of the sequence.

Discussion

Data smoothing is a very useful technique for emphasising
apparent slow trends in sequences of data. We have to be very careful, however,
not to push it too far, especially in trying to identify periodicities in the
data. We must also avoid giving too much credence to variations at the beginning
(or the end!) of the smoothed sequence. Given these provisos, both the exponentially weighted
average and the running mean are effective and can be implemented by means of
efficient recursive formulae, though surprisingly often extremely inefficient
non-recursive forms are applied. These simple examples are of real-time (or
one=sided) filters, which only use present and past values, a necessary
constraint in many important applications. There are many more elaborate
methods, which require a much higher level of precaution.

The illustrations are condensed from Laboratory
online computing, a very old (1975) and forgotten book by the author.