The construction of human ears makes them especially

The construction of human ears makes them especially

The Human “Earscape” model is a tool
that can be used to optimise the digitalisation of audio using the range and
varying sensitivety of human hearing. Humans can only hear sounds from
approximately 20Hz to 20kHz, allowing the omission of frequencies outside of
this range. Additionally, the sensitivity over this range is non-uniform, with
sensitivity dropping off towards the ends of this range. The construction of
human ears makes them especially sensitive to frequencies around 2-5kHz,
covering a large portion of human speech. This correspondence between frequency
and sensitivity, as well as the bounds of human hearing, forms the Human
“Earscape” model. We can use this model to choose our quantisation levels for
different areas of the frequency spectrum to match human hearing, using more
data to represent areas that humans are more sensitive to and less where
sensitivity is lower. This technique of using unequally spaced quantization
levels is used in the A-law and ?-law companding algorithms and dramatically
reduces bandwidth, and hence bitrate, without a significantly perceivable degradation
of quality in the audio (usually speech in this case).

Instead of looking at the frequency
range of human hearing and the sensitivity to differing frequencies, the Audio
Noise Masking model focusses on the interaction between neighbouring
frequencies and human perception of these. The construction of the human ear results
in ranges of frequencies stimulating the same nerves. Each band of frequencies
that stimulates the same region of nerves is called a critical band, and the
ability to distinguish multiple frequencies is diminished within each critical
band. In particular, if there is a sufficiently strong signal at a specific
frequency within a critical band, weaker signals at nearby frequencies within
the same critical band are unperceivable. This process is called auditory
masking and is a perceptual weakness that forms the basis for the Audio Noise
Masking model in compression. We can exploit this weakness by observing the
various frequencies within each critical band and removing those which are
unperceivable due to masking. Hence we can significantly reduce the bitrate
without a perceivable degradation of quality by removing inaudible information
and assign our quantization noise to these areas.

We Will Write a Custom Essay SpecificallyFor You For Only $13.90/page!

Audio Layer I of the MPEG audio
standard begins with the decomposition of the input audio signal into 32 frequency
sub?bands, each around 700Hz wide covering the full audible spectrum. This is
achieved by passing the input signal through a filter bank consisting of
multiple band-pass filters, each of which only allows through frequencies
within a certain fixed range. The sub?bands created are designed to mimic the
critical bands of the human auditory system, allowing us to exploit the Audio
Noise Masking model. Each sub?band is quantized and encoded separately according
to this model, with the signal-to-mask ratio being calculated for each. This is
the ratio of signal energy to the masking threshold (the minimum signal
strength required for a sound to audible in the presence of a masking signal).
This ratio determines the number of quantization levels used.

One issue with this process is that the
band-pass filters used to split the input signal are not perfect, introducing
some noise and reducing quality. Additionally, when the audio is decoded, it
must pass through an inverse filter band that recombines the signal which again
is a lossy transformation. The sub?bands used here also have constant width
across the audible spectrum, and do not accurately represent the critical bands
of the human ear which increase in width exponentially with frequency. Hence
these sub?bands do not exactly mirror human auditory behaviour and degrade the
audio quantization. Finally, the band-pass filters are not sharply defined and
so neighbouring sub?bands overlap. This means that a signals at certain
frequencies can affect two bands at once, introducing aliasing artefacts and
degrading quality.

On one hand, whilst the compression is
ultimately lossy, the difference in quality between the input and the output is
unperceivable to human ears. This is because the degradation in quality
resulting from imperfect splitting and combining transformations is very subtle,
and the quantization noise introduced is allocated to unperceivable parts of
the audio signal. Hence, compared to less sophisticated quantization algorithms
(eg. PCM), an MPEG-1 Layer 1 output will be significantly higher quality for
the same bitrate. For this reason, on balance, the division of the input signal
into 32 sub-bands in the MPEG-1 Audio Layer I standard improved the
quantization quality.

The Discrete Cosine Transform used in MPEG/Audio
Layer 3 is a mathematical function applied to the output of the filter bank
used in Layers I & II. This transformation decomposes the input signal into
a sum of component frequencies, representing each with a cosine wave, and
assigning each a coefficient specifying its amplitude. This transformation
invertible, and both the DCT and its inverse are lossless. This process maps
the input signal from the time domain to the frequency domain, which is the
main purpose of involving DCT in MPEG audio compression. Unlike in the time
domain where a signal’s variation is described over time, the frequency domain
allows us to express a signal purely with respect to frequency. This transformation
hence gives greater insight into the components that make up the signal, and allows
us to perform more advanced compression algorithms on the signal based on human
perception. These use the human psychoacoustic model to filter out perceptually
inaudible parts of the input signal and more closely correspond to human hearing
than those used in the other audio layers. Hence the

One drawback of this compression
technique is that DCT and its inverse are more computationally intensive than
those used in the other layers, potentially making encoding, decoding and
manipulating audio in real time slower. However, with modern technology this is
not a concern, with the algorithm running quickly and smoothly on even very
weak systems, and the processing penalty more than justified by the improvement
in audio quality and great reduction in file size.

Similar to the Discrete Cosine
Transform, a Wavelet transform decomposes and represents audio signals as a sum
of wave functions. However, instead of being limited to cosine functions as
with DCT, the Wavelet transform can be performed with almost any family of
functions including sinusoidal, . These wavelet functions can be scaled to
different frequencies and also shifted temporally, unlike in DTC, and in
general this process produces a more detailed picture of the signal being
analysed. An optimal wavelet representation can even be chosen for each
individual frame that matches the characteristics of the signal better than a
cosine series. As a result, fewer terms in the sum are required to accurately
encode the frame, reducing number of non-zero coefficients required. These
coefficients can hence be encoded using fewer bits, reducing the bitrate and
increasing the compression ratio. As a result a Wavelet transform could
dramatically reduce file sizes and increase quality.

However, in order to achieve this we need
to perform analysis on each frame to determine the optimal wavelet, and this is
computationally intensive. Encoding the audio can therefore take a long time as
this optimisation must be performed, although decoding is a comparable speed to
with DCT. We would be able to make use of human perception in our compression
in a similar fashion to that using DCT, however we would need to adapt the
psychoacoustic model to fit wavelet representation to achieve this.

In 2000 the JPEG-2000 standard
launched incorporating wavelet transform, and similar to with audio
compression, this promised better compression and improved image quality.
However the standard has been a commercial failure, and even today few
consumers and manufacturers have adopted the standard and many applications not
supporting it. This is widely regarded to be accountable to a developers being
reluctant to incorporate the new standard while also supporting the original,
the dominance of basic JPEG, and slower processing. This case study gives

However many audio standards in use –
less dominance – possibility of new format. Processing less time less of an
issue with todays computers, compression becoming very important for audio with
eg streaming – potential for more support. Overall the benefits outweigh the
drawbacks making the replacement of DCT with Wavelet transform feasible.