This thesis presents a fully pipelined and parameterised parallel hardware implementation of a large vocabulary, user-independent and continuous speech recognition system for use in mobile applications. Algorithm acceleration is achieved by realising in hardware the most time-consuming components of the speech recognition system. By adopting a parallel solution, the necessary calculations can be completed in a sufficiently short elapsed time for embedded target systems.
Sphinx 3 is identified as an appropriate speech recognition system for this work and is profiled to determine the most time-consuming parts of the code. As these parts of the code employ calculations based on floating point operations, which are not suitable for the high-performance and low-power execution on embedded systems, these calculations have been converted to scaled integer operations. It is verified using the AN4, RM1 and TIMIT speech databases that the scaled integer version of the speech recognition system can achieve a similar word error rate to the original floating point version, while taking less than 8% of the calculation time used by the original version.
The scaled integer version of the speech recognition system is redesigned in VHDL for parallel implementation in electronic hardware. The designs of a calculation module and a data module are described, both of which can be configured according to the number of parallel units and the data module can be configured according to the total numbers of feature vectors and senones used in the speech representation. The hardware designs are synthesised to a range of FPGAs and the results showed that the larger Virtex7 devices are capable of holding several thousands of senones which are sufficient for most recognition tasks. Hardware designs with different numbers of parallel calculation units are simulated at both behavioural level and platform-based level and the resulting implementations are able to operate in real time. The results show that the hardware implementation, even with only one calculation unit, can perform the same calculations almost 80 times faster than does a modern embedded microprocessor, even when operating at only one fifth of the clock frequency. With larger numbers of parallel calculation units, the whole design can operate at even lower clock frequencies, saving power while maintaining a rapid calculation speed. The hardware designs are also implemented on a physical system having both an FPGA and
a microprocessor board to demonstrate the operational capabilities of a full system.

Description:

A Doctoral Thesis. Submitted in partial fulfillment of the requirements for the award of Doctor of Philosophy of Loughborough University.