News and features

Website areas

Dr Philip Jackson

Biography

Biography

I am interested in what you can do with acoustical signals, including speech, music and the everyday sounds all around us. Through research on projects such as Nephthys, Columbo, BALTHASAR, DANSA, SAVEE, DynamicFaces, QESTRAL, POSZ, UDRC2 and S3A, I have contributed to active noise control for aircraft, speech aero-acoustics, source separation and articulatory models for automatic speech recognition, audio-visual emotion classification and visual speech synthesis, including new techniques for spatial audio and personal sound.

I joined CVSSP in 2002 after a UK postdoctoral fellowship at University of Birmingham, with PhD in Electronics & Computer Science from University of Southampton (2000) and MA from Cambridge University Engineering Department (1997). I now have over 100 journal, patent, conference and book publications (Google h-index=12) and serve as associate editor for Computer Speech & Language (Elsevier), and as reviewer for the Journal of the Acoustical Society of America, IEEE/ACM Transactions on Audio, Speech & Language Processing, IEEE Signal Processing Letters, InterSpeech and ICASSP.

My publications

Publications

Recent advances in human-computer interaction technology go beyond the successful transfer of data between human and machine by seeking to improve the naturalness and friendliness of user interactions. An important augmentation, and potential source of feedback, comes from recognizing the user?s expressed emotion or affect. This chapter presents an overview of research efforts to classify emotion using different modalities: audio, visual and audio-visual combined. Theories of emotion provide a framework for defining emotional categories or classes. The first step, then, in the study of human affect recognition involves the construction of suitable databases. The authorsdescribe fifteen audio, visual and audio-visual data sets, and the types of feature that researchers have used to represent the emotional content. They discuss data-driven methods of feature selection and reduction, which discard noise and irrelevant information to maximize the concentration of useful information. They focus on the popular types of classifier that are used to decide to which emotion class a given example belongs, and methods of fusing information from multiple modalities. Finally, the authors point to some interesting areas for future investigation in this field, and conclude.

This paper presents an investigation of the visual variation on the bilabial plosive consonant /p/ in three coarticulation contexts.
The aim is to provide detailed ensemble analysis to assist coarticulation modelling in visual speech synthesis. The underlying dynamics of labeled visual speech units, represented as lip
shape, from symmetric VCV utterances, is investigated. Variation in lip dynamics is quantitively and qualitatively analyzed.
This analysis shows that there are statistically significant differences in both the lip shape and trajectory during coarticulation.

Reverberant speech source separation has been of great interest for over a decade, leading to two major approaches. One of them is based on statistical properties of the signals and mixing process known as blind source separation (BSS). The other approach named as computational auditory scene analysis (CASA) is inspired by human auditory system and exploits monaural and binaural cues. In this paper these two approaches are studied and compared in more depth.

Recognition of expressed emotion from speech and facial gestures was investigated in experiments on an audio-visual emotional database. A total of 106 audio and 240 visual features were extracted and then features were selected with Plus l-Take Away r algorithm based on Bhattacharyya distance criterion. In the second step, linear transformation methods, principal component analysis (PCA) and linear discriminant analysis (LDA), were applied to the selected features and Gaussian classifiers were used for classification of emotions. The performance was higher for LDA features compared to PCA features. The visual features performed better than audio features, for both PCA and LDA. Across a range of fusion schemes, the audio-visual feature results were close to that of visual features. A highest recognition rate of 53% was achieved with audio features, 98% with visual features, and 98% with audio-visual features selected by Bhattacharyya distance and transformed by LDA.

We present a novel method for extracting target speech from auditory mixtures using bimodal coherence, which is statistically characterised by a Gaussian mixture modal (GMM) in the offline
training process, using the robust features obtained from the audio-visual speech. We then adjust the ICA-separated spectral components using the bimodal coherence in the time-frequency
domain, to mitigate the scale ambiguities in different frequency bins. We tested our algorithm on the XM2VTS database, and the results show the performance improvement with our proposed algorithm in terms of SIR measurements.

Natural-sounding reproduction of sound over headphones requires accurate estimation of an individual's Head-Related Impulse Responses (HRIRs), capturing details relating to the size and shape of the body, head and ears. A stereo-vision face capture system was used to obtain 3D geometry, which provided surface data for boundary element method (BEM) acoustical simulation. Audio recordings were filtered by the output HRIRs to generate samples for a comparative listening test alongside samples generated with dummy-head HRIRs. Preliminary assessment showed better localization judgements with the personalized HRIRs by the corresponding participant, whereas other listeners performed better with dummy-head HRIRs, which is consistent with expectations for personalized HRIRs. The use of visual measurements for enhancing users' auditory experience merits investigation with additional participants.

Sound zone reproduction facilitates listeners wishing to consume personal audio content within the same acoustic enclosure by filtering loudspeaker signals to create constructive and destructive interference in different spatial regions. Published solutions to the sound zone problem are derived from areas such as sound field synthesis and beamforming. The first contribution of this thesis is a comparative study of multi-point approaches. A new metric of planarity is adopted to analyse the spatial distribution of energy in the target zone, and the well-established metrics of acoustic contrast and control effort are also used. Simulations and experimental results demonstrate the advantages and disadvantages of the approaches. Energy cancellation produces good acoustic contrast but allows very little control over the target sound field; synthesis-derived approaches precisely control the target sound field but produce less contrast.
Motivated by the limitations of the existing optimization methods, the central contribution of this thesis is a proposed optimization cost function ?planarity control?, which maximizes the acoustic contrast between the zones while controlling sound field planarity by projecting the target zone energy into a spatial domain. Planarity control is shown to achieve good contrast and high target zone planarity over a large frequency range. The method also has potential for reproducing stereophonic material in the context of sound zones.
The remaining contributions consider two further practical concerns. First, judicious choice of the regularization parameter is shown to have a significant effect on the contrast, effort and robustness. Second, attention is given to the problem of optimally positioning the loudspeakers via a numerical framework and objective function.
The simulation and experimental results presented in this thesis represent a significant addition to the literature and will influence the future choices of control methods, regularization and loudspeaker placement for personal audio. Future systems may incorporate 3D rendering and listener tracking.

The aim of the study is to learn the relationship between facial movements and the acoustics of speech sounds. We recorded A database of 3D video of the face, including markers, and corresponding synchronized audio of a single speaker. The database consists of 110 English sentences. These sentences were selected for strong expressive content in the fundamental emotions: Anger, Surprise, Sadness, Happiness, Fear and Disgust. Comparisons are made with the same sentences with neutral expression. Principal component analysis of the marker movements was performed to identify significant modes of variation. The results of this analysis show that there are various characteristic difference between visual features of emotional versus neutral speech. The findings of the current research provide a basis for generating realistic animations of emotional speech for applications such as computer games and films.

Underdetermined reverberant speech separation is a challenging problem in source separation that has received considerable attention in both computational auditory scene analysis (CASA) and blind source separation (BSS). Recent studies suggest that, in general, the performance of frequency domain BSS methods suffer from the permutation problem across frequencies which degrades in high reverberation, meanwhile, CASA methods perform less effectively for closely spaced sources. This paper presents a method to address these limitations, based on the combination of binaural and BSS cues for the automatic classification of time-frequency (T-F) units of the speech mixture spectrogram. By modeling the interaural phase difference, the interaural level difference and frequency-bin mixing vectors, we integrate the coherent information for each source within a probabilistic framework. The Expectation Maximization (EM) algorithm is then used iteratively to refine the soft assignment of T-F regions to sources and re-estimate their model parameters. The coherence between the left and right recordings is also calculated to model the precedence effect which is then incorporated to the algorithm to reduce the effect of reverberation. Binaural room impulse responses for 5 different rooms with various acoustic properties have been used to generate the source images and the mixtures. The proposed method compares favorably with state-of-the-art baseline algorithms by Mandel et al. and Sawada et al., in terms of signal-to-distortion ratio (SDR) of the separated source signals.

An automatic method for identifying critical, dependent and redundant roles in speech articulation is presented. Critical articulators are identified using the Kullback-Leibler divergence between phone-specific and model pdfs, which are initialised to the grand pdfs for each articulator. Model pdfs of critical and dependent articulators, those significantly correlated with the critical ones, are updated accordingly for both 1D and 2D cases, as long as the divergence exceeds the threshold. Those unaffected are termed redundant. Algorithm performance is evaluated on the MOCHA-TIMIT database by comparison with phonetic features. Results are also given for an exhaustive search, and principal component analysis of articulatory fleshpoints. Implications of being able to extract phonetic constraints automatically from articulatory recordings are discussed.

In this paper the mixing vector (MV) in the statistical mixing model is compared to the binaural cues represented by interaural level and phase differences (ILD and IPD). It is shown that the MV distributions are quite distinct while binaural models overlap when the sources are close to each other. On the other hand, the binaural cues are more robust to high reverberation than MV models. According to this complementary behavior we introduce a new robust algorithm for stereo speech separation which considers both additive and convolutive noise signals to model the MV and binaural cues in parallel and estimate probabilistic time-frequency masks. The contribution of each cue to the final decision is also adjusted by weighting the log-likelihoods of the cues empirically. Furthermore, the permutation problem of the frequency domain blind source separation (BSS) is addressed by initializing the MVs based on binaural cues. Experiments are performed systematically on determined and underdetermined speech mixtures in five rooms with various acoustic properties including anechoic, highly reverberant, and spatially-diffuse noise conditions. The results in terms of signal-to-distortion-ratio (SDR) confirm the benefits of integrating the MV and binaural cues, as compared with two state-of-the-art baseline algorithms which only use MV or the binaural cues.

Audio production is moving toward an object-based approach, where content is represented as audio together with metadata that describe the sound scene. From current object definitions, it would usually be expected that the audio portion of the object is free from interfering sources. This poses a potential problem for object-based capture, if microphones cannot be placed close to a source. This paper investigates the application of microphone array beamforming to separate a mixture into distinct audio objects. Real mixtures recorded by a 48-channel microphone array in reflective rooms were separated, and the results were evaluated using perceptual models in addition to physical measures based on the beam pattern. The effect of interfering objects was reduced by applying the beamforming techniques.

The ability to replicate a plane wave represents an essential element of spatial sound field reproduction. In sound field synthesis, the desired field is often formulated as a plane wave and the error minimized; for other sound field control methods, the energy density or energy ratio is maximized. In all cases and further to the reproduction error, it is informative to characterize how planar the resultant sound field is. This paper presents a method for quantifying a region's acoustic planarity by superdirective beamforming with an array of microphones, which analyzes the azimuthal distribution of impinging waves and hence derives the planarity. Estimates are obtained for a variety of simulated sound field types, tested with respect to array orientation, wavenumber, and number of microphones. A range of microphone configurations is examined. Results are compared with delay-and-sum beamforming, which is equivalent to spatial Fourier decomposition. The superdirective beamformer provides better characterization of sound fields and is effective with a moderate number of omni-directional microphones over a broad frequency range. Practical investigation of planarity estimation in real sound fields is needed to demonstrate its validity as a physical sound field evaluation measure.

In this paper we present a system for localization and separation
of multiple speech sources using phase cues. The novelty of this
method is the use of Random Sample Consensus (RANSAC) approach
to find consistency of interaural phase differences (IPDs)
across the whole frequency range. This approach is inherently free
from phase ambiguity problems and enables all phase data to contribute
to localization. Another property of RANSAC is its robustness
against outliers which enables multiple source localization
with phase data contaminated by reverberation noise. Results of
RANSAC based localization are fed into a mixture model to generate
time-frequency binary masks for separation. System performance
is compared against other well known methods and shows
similar or improved performance in reverberant conditions.

Sound field control to create multiple personal audio spaces (sound zones) in a shared listening environment is an active research topic. Typically, sound zones in the literature have aimed to reproduce monophonic audio programme material. The planarity control optimization approach can reproduce sound zones with high levels of acoustic contrast, while constraining the energy flux distribution in the target zone to impinge from a certain range of azimuths. Such a constraint has been shown to reduce problematic self-cancellation artefacts such as uneven sound pressure levels and complex phase patterns within the target zone. Furthermore, multichannel reproduction systems have the potential to reproduce spatial audio content at arbitrary listening positions (although most exclusively target a `sweet spot'). By designing the planarity control to constrain the impinging energy rather tightly, a sound field approximating a plane-wave can be reproduced for a listener in an arbitrarily-placed target zone. In this study, the application of planarity control for stereo reproduction in the context of a personal audio system was investigated. Four solutions, to provide virtual left and right channels for two audio programmes, were calculated and superposed to achieve the stereo effect in two separate sound zones. The performance was measured in an acoustically treated studio using a 60 channel circular array, and compared against a least-squares pressure matching solution whereby each channel was reproduced as a plane wave field. Results demonstrate that planarity control achieved 6 dB greater mean contrast than the least-squares case over the range 250-2000 Hz. Based on the principal directions of arrival across frequency, planarity control produced azimuthal RMSE of 4.2/4.5 degrees for the left/right channels respectively (least-squares 2.8/3.6 degrees). Future work should investigate the perceived spatial quality of the implemented system with respect to a reference stereophonic setup.

Reproduction of multiple sound zones, in which personal audio programs may be consumed without the need for headphones, is an active topic in acoustical signal processing. Many approaches to sound zone reproduction do not consider control of the bright zone phase, which may lead to self-cancellation problems if the loudspeakers surround the zones. Conversely, control of the phase in a least-squares sense comes at a cost of decreased level difference between the zones and frequency range of cancellation. Single-zone approaches have considered plane wave reproduction by focusing the sound energy in to a point in the wavenumber domain. In this article, a planar bright zone is reproduced via planarity control, which constrains the bright zone energy to impinge from a narrow range of angles via projection in to a spatial domain. Simulation results using a circular array surrounding two zones show the method to produce superior contrast to the least-squares approach, and superior planarity to the contrast maximization approach. Practical performance measurements obtained in an acoustically treated room verify the conclusions drawn under free-field conditions.

Recent attention to the problem of controlling multiple loudspeakers to create sound zones has been directed toward practical issues arising from system robustness concerns. In this study, the effects of regularization are analyzed for three representative sound zoning methods. Regularization governs the control effort required to drive the loudspeaker array, via a constraint in each optimization cost function. Simulations show that regularization has a significant effect on the sound zone performance, both under ideal anechoic conditions and when systematic errors are introduced between calculation of the source weights and their application to the system. Results are obtained for speed of sound variations and loudspeaker positioning errors with respect to the source weights calculated. Judicious selection of the regularization parameter is shown to be a primary concern for sound zone system designers-the acoustic contrast can be increased by up to 50 dB with proper regularization in the presence of errors. A frequency-dependent minimum regularization parameter is determined based on the conditioning of the matrix inverse. The regularization parameter can be further increased to improve performance depending on the control effort constraints, expected magnitude of errors, and desired sound field properties of the system.

Spontaneous speech in videos capturing the speaker's mouth provides bimodal information.
Exploiting the relationship between the audio and visual streams, we propose a new visual
voice activity detection (VAD) algorithm, to overcome the vulnerability of conventional
audio VAD techniques in the presence of background interference. First, a novel lip
extraction algorithm combining rotational templates and prior shape constraints with
active contours is introduced. The visual features are then obtained from the extracted
lip region. Second, with the audio voice activity vector used in training, adaboosting is
applied to the visual features, to generate a strong final voice activity classifier by
boosting a set of weak classifiers. We have tested our lip extraction algorithm on the
XM2VTS database (with higher resolution) and some video clips from YouTube (with lower
resolution). The visual VAD was shown to offer low error rates.

The audio scene from broadcast soccer can be used for identifying highlights from the game. Audio cues derived from these sources provide valuable information about game events, as can the detection of key words used by the commentators.
In this paper we interpret the feasibility of incorporating both commentator word recognition and information about the additive background noise in an HMM structure. A limited set of audio cues, which have been extracted from
data collected from the 2006 FIFA World Cup, are used to create an extension to the Aurora-2 database. The new database is then tested with various PMC models and compared to the standard baseline, clean and multi-condition training methods. It is found that incorporating SNR and noise type information into the PMC process is beneficial to recognition performance.

Audio (spectral) and modulation (envelope) frequencies both carry information in a speech signal. While low modulation frequencies (2-20Hz) convey syllable information, higher modulation frequencies (80-400Hz) allow for assimilation of perceptual cues, e.g., the roughness of amplitude-modulated noise in voiced fricatives, considered here. Psychoacoustic 3-interval forced-choice experiments measured AM detection thresholds for modulated noise accompanied by a tone with matching fundamental frequency at 125Hz: (1) tone-to-noise ratio (TNR) and phase between tone and noise envelope were varied, with silence between intervals; (2) as (1) with continuous tone throughout each trial; (3) duration and noise spectral shape were varied. Results from (1) showed increased threshold (worse detection) for louder tones (40-50dB TNR). In (2), a similar effect was observed for the in-phase condition, but out-of-phase AM detection appeared immune to the tone. As expected, (3) showed increased thresholds for shorter tokens, although still detectable at 60ms, and no effect for spectral shape. The phase effect of (2) held for the short stimuli, with implications for fricative speech tokens (40ms-100ms). Further work will evaluate the strength of this surprisingly robust cue in speech.

The acoustic environment affects the properties of the audio signals recorded. Generally, given room impulse responses (RIRs), three sets of parameters have to be extracted in order to create an acoustic model of the environment: sources, sensors and reflector positions. In this paper, the cross-correlation based iterative sensor position estimation (CISPE) algorithm is presented, a new method to estimate a microphone configuration, together with source and reflector position estimators. A rough measurement of the microphone positions initializes the process; then a recursive algorithm is applied to improve the estimates, exploiting a delay-and-sum beamformer. Knowing where the microphones lie in the space, the dynamic programming projected phase slope algorithm (DYPSA) extracts the times of arrival (TOAs) of the direct sounds from the RIRs, and multiple signal classification (MUSIC) extracts the directions of arrival (DOAs). A triangulation technique is then applied to estimate the source positions. Finally, exploiting properties of 3D quadratic surfaces (namely, ellipsoids), reflecting planes are localized via a technique ported from image processing, by random sample consensus (RANSAC). Simulation tests were performed on measured RIR datasets acquired from three different rooms located at the University of Surrey, using either a uniform circular array (UCA) or uniform rectangular array (URA) of microphones. Results showed small improvements with CISPE pre-processing in almost every case.

Phonetic detail of voiced and unvoiced fricatives was examined using speech analysis tools. Outputs of eight f0 trackers were combined to give reliable voicing and f0 values. Log - energy and Mel frequency cepstral features were used to train a Gaussian classifier that objectively labeled speech frames for frication. Duration statistics were derived from the voicing and frication labels for distinguishing between unvoiced and voiced fricatives in British English and European Portuguese.

The spatial attribute envelopment has long been considered an important property of excellent concert hall acoustics. In the past, research in this area has produced the definition listener envelopment (LEV) and several equations designed to predict it. However with the recent development of multichannel audio systems capable of positioning sound sources all around the listener, it is apparent that the attribute is not so easily defined and that a more appropriate definition may be needed. This poster introduces a definition of envelopment more appropriate for multichannel audio and outlines a recent pilot experiment conducted by the authors.

Representing a complex acoustic scene with audio objects is desirable but challenging in object-based spatial audio production and reproduction, especially when concurrent sound signals are present in the scene. Source separation (SS) provides a potentially useful and enabling tool for audio object extraction. These extracted objects are often remixed to reconstruct a sound field in the reproduction stage. A suitable SS method is expected to produce audio objects that ultimately deliver high quality audio after remix. The performance of these SS algorithms therefore needs to be evaluated in this context. Existing metrics for SS performance evaluation, however, do not take into account the essential sound field reconstruction process. To address this problem, here we propose a new SS evaluation method which employs a remixing strategy similar to the panning law, and provides a framework to incorporate the conventional SS metrics. We have tested our proposed method on real-room recordings processed with four SS methods, including two state-of-the art blind source separation (BSS) methods and two classic beamforming algorithms. The evaluation results based on three conventional SS metrics are analysed.

Underdetermined reverberant speech separation is a challenging problem in source sep- aration that has received considerable attention in both computational auditory scene analysis (CASA) and blind source separation (BSS). Recent studies suggest that, in general, the performance of frequency domain BSS methods suffer from the permuta- tion problem across frequencies which degrades in high reverberation, meanwhile, CASA methods perform less effectively for closely spaced sources. This paper presents a method to address these limitations, based on the combination of monaural, binaural and BSS cues for the automatic classification of time-frequency (T-F) units of the speech mixture spectrogram. By modeling the interaural phase difference, the interaural level difference and frequency-bin mixing vectors, we integrate the coherence information for each source within a probabilistic framework. The Expectation-Maximization (EM) algorithm is then used iteratively to refine the soft assignment of TF regions to sources and re-estimate their model parameters. It is observed that the reliability of the cues affects the accu- racy of the estimates and varies with respect to cue type and frequency. As such, the contribution of each cue to the assignment decision is adjusted by weighting the log- likelihoods of the cues empirically, which significantly improves the performance. Results are reported for binaural speech mixtures in five rooms covering a range of reverberation times and direct-to-reverberant ratios. The proposed method compares favorably with state-of-the-art baseline algorithms by Mandel et al. and Sawada et al., in terms of signal- to-distortion ratio (SDR) of the separated source signals. The paper also investigates the effect of introducing spectral cues for integration within the same framework. Analysis of the experimental outcomes will include a comparison of the contribution of individual cues under varying conditions and discussion of the implications for system optimization.

A physical model, built to investigate the aeroacoustic properties of voiced fricative speech, was used to study the amplitude modulation of the turbulence noise it generated. The amplitude and fundamental frequency of glottal vibration, relative positions of the constriction and obstacle, and the flow rate were varied. Measurements were made from pressure taps in the duct wall and the sound pressure at the open end. The high-pass filtered sound pressure was analyzed in terms of the magnitude and phase of the turbulence noise envelope. The magnitude and phase of the observed modulation was related to the upstream pressure. The effects of moving the obstacle with respect to the constriction are reported (representative of the teeth and the tongue in a sibilant fricative respectively). These results contribute to the development of a parametric model of the aeroacoustic interaction of voicing with turbulence noise generation in speech.

Audio systems and recordings are optimized for listening at the ?sweet spot?, but how well do they work elsewhere? An acoustic-perceptual model has been developed that simulates sound reproduction in a variety of formats, including mono, two-channel stereo, five-channel surround and wavefield synthesis. A virtual listener placed anywhere in the listening area is used to extract binaural signals, and hence interaural cues to the spatial attributes of the soundfield. Using subjectively-validated models of spatial sound perception, we can predict the way that human listeners would perceive these attributes, such as the direction (azimuth) and width of a phantom source. Results will be presented across the listening area for different source signals, sound scenes and reproduction systems, illustrating their spatial fidelity in perceptual terms. Future work investigates the effects of typical reproduction degradations.

The spatial quality of audio content delivery systems is becoming increasingly important as service providers attempt to deliver enhanced experiences of spatial immersion and naturalness in audio-visual applications. Examples are virtual reality, telepresence, home cinema, games and communications products. The QESTRAL project is developing an artificial listener that will compare the perceived quality of a spatial audio reproduction to a reference reproduction. The model is calibrated using data from listening tests, and utilises a range of metrics to predict the resulting spatial sound quality ratings. Potential application areas for the model are outlined, together with exemplary results obtained from some of its component parts.

The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech signals into their voiced and unvoiced constituents. In this paper, we evaluate its ability to reconstruct the time series of the two components accurately using a variety of synthetic, speech-like signals, and discuss its performance. These results determine the degree of confidence that can be expected for real speech signals: typically, 5 dB improvement in the signal-to-noise ratio (HNR) in the anharmonic component. A selection of the analysis oportunities that the decomposition offers is demonstrated on speech recording, including dynamic HNR estimation and separate linear prediction analyses of the two components. These new capabilities provided by the PSHF can facilitate discovering previously hidden features and investigating interactions of unvoiced sources, such as friction, with voicing.

The two distinct sound sources comprising voiced frication, voicing and frication, interact. One effect is that the periodic source at the glottis modulates the amplitude of the frication source originating in the vocal tract above the constriction. Voicing strength and modulation depth for frication noise were measured for sustained English voiced fricatives using high-pass filtering, spectral analysis in the modulation (envelope) domain, and a variable pitch compensation procedure. Results show a positive relationship between strength of the glottal source and modulation depth at voicing strengths below 66 dB SPL, at which point the modulation index was approximately 0.5 and saturation occurred. The alveolar [z] was found to be more modulated than other fricatives.

Decomposition of speech signals into simultaneous streams of periodic and aperiodic information has been successfully applied to speech analysis, enhancement, modification and recently recognition. This paper examines the effect of different weightings of the two streams in a conventional HMM system in digit recognition tests on the Aurora 2.0 database. Comparison of the results from using matched weights during training showed a small improvement of approximately 10% relative to unmatched ones, under clean test conditions. Principal component analysis of the covariation amongst the periodic and aperiodic features indicated that only 45 (51) of the 78 coefficients were required to account for 99% of the variance, for clean (multi-condition) training, which yielded an 18.4% (10.3%) absolute increase in accuracy with respect to the baseline. These findings provide further evidence of the potential for harmonically-decomposed streams to improve performance and substantially to enhance recognition accuracy in noise.

For many audio applications, availability of recorded multi-channel room impulse responses (MC-RIRs) is fundamental. They enable development and testing of acoustic systems for reflective rooms. We present multiple MC-RIR datasets recorded in diverse rooms, using up to 60 loudspeaker positions and various uniform compact microphone arrays. These datasets complement existing RIR libraries and have dense spatial sampling of a listening position. To reveal the encapsulated spatial information, several state of the art room visualization methods are presented. Results confirm the measurement fidelity and graphically depict the geometry of the recorded rooms. Further investigation of these recordings and visualization methods will facilitate object-based RIR encoding, integration of audio with other forms of spatial information, and meaningful extrapolation and manipulation of recorded compact microphone array RIRs.

Recent studies show that facial information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterization of the coherence between the audio and visual speech using, e.g., a Gaussian mixture model (GMM). In this paper, we present three contributions. With the synchronized features, we propose an adapted expectation maximization (AEM) algorithm to model the audio?visual coherence in the off-line training process. To improve the accuracy of this coherence model, we use a frame selection scheme to discard nonstationary features. Then with the coherence maximization technique, we develop a new sorting method to solve the permutation problem in the frequency domain. We test our algorithm on a multimodal speech database composed of different combinations of vowels and consonants. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS, which confirms the benefit of using visual speech to assist in separation of the audio.

Boundary estimation from an acoustic room impulse response (RIR), exploiting known sound propagation behavior, yields useful information for various applications: e.g., source separation, simultaneous localization and mapping, and spatial audio. The baseline method, an algorithm proposed by Antonacci et al., uses reflection times of arrival (TOAs) to hypothesize reflector ellipses. Here, we modify the algorithm for 3-D environments and for enhanced noise robustness: DYPSA and MUSIC for epoch detection and direction of arrival (DOA) respectively are combined for source localization, and numerical search is adopted for reflector estimation. Both methods, and other variants, are tested on measured RIR data; the proposed method performs best, reducing the estimation error by 30%.

Object-based audio is gaining momentum as a means for future audio productions to be format-agnostic and interactive. Recent standardization developments make recommendations for object formats, however the capture, production and reproduction of reverberation is an open issue. In this paper, we review approaches for recording, transmitting and rendering reverberation over a 3D spatial audio system. Techniques include channel-based approaches where room signals intended for a specific reproduction layout are transmitted, and synthetic reverberators where the room effect is constructed at the renderer. We consider how each approach translates into an object-based context considering the end-to-end production chain of capture, representation, editing, and rendering. We discuss some application examples to highlight the implications of the various approaches.

When a noise process is modulated by a deterministic signal, it is often useful to determine the signal's parameters. A method of estimating the modulation index m is presented for noise whose amplitude is modulated by a periodic signal, using the magnitude modulation spectrum (MMS). The method is developed for application to real discrete signals with time- varying parameters, and extended to a 3D time-frequency- modulation representation. In contrast to squared-signal approaches, MMS behaves linearly with the modulating function allowing separate estimation of m for each harmonic. Simulations evaluate performance on synthetic signals, compared with theory, favouring a first-order MMS estimator.

In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

In this paper, a novel probabilistic Bayesian tracking scheme is proposed and applied to bimodal measurements consisting of tracking results from the depth sensor and audio recordings collected using binaural microphones. We use random finite sets to cope with varying number of tracking targets. A measurement-driven birth process is integrated to quickly localize any emerging person. A new bimodal fusion method that prioritizes the most confident modality is employed. The approach was tested on real room recordings and experimental results show that the proposed combination of audio and depth outperforms individual modalities, particularly when there are multiple people talking simultaneously and when occlusions are frequent.

Acoustic event detection for content analysis in most cases relies on lots of labeled data. However, manually annotating data is a time-consuming task, which thus makes few annotated resources available so far. Unlike audio event detection, automatic audio tagging, a multi-label acoustic event classification task, only relies on weakly labeled data. This is highly desirable to some practical applications using audio analysis. In this paper we propose to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way. Considering that only chunk-level rather than frame-level labels are available, the whole or almost whole frames of the chunk were fed into the DNN to perform a multi-label regression for the expected tags. The fully DNN, which is regarded as an encoding function, can well map the audio features sequence to a multi-tag vector. A deep pyramid structure was also designed to extract more robust high-level features related to the target tags. Further improved methods were adopted, such as the Dropout and background noise aware training, to enhance its generalization capability for new audio recordings in mismatched environments. Compared with the conventional Gaussian Mixture Model (GMM) and support vector machine (SVM) methods, the proposed fully DNN-based method could well utilize the long-term temporal information with the whole chunk as the input. The results show that our approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.

This paper explores the recognition of expressed emotion from speech and facial gestures for the speaker-dependent case. Experiments were performed on an English audio-visual emotional database consisting of 480 utterances from 4 English male actors in 7 emotions. A total of 106 audio and 240 visual features were extracted and features were selected with Plus l-Take Away r algorithm based on Bhattacharyya distance criterion. Linear transformation methods, principal component analysis (PCA) and linear discriminant analysis (LDA), were applied to the selected features and Gaussian classifiers were used for classification. The performance was higher for LDA features compared to PCA features. The visual features performed better than the audio features and overall performance improved for the audio-visual features. In case of 7 emotion classes, an average recognition rate of 56% was achieved with the audio features, 95% with the visual features and 98% with the audio-visual features selected by Bhattacharyya distance and transformed by LDA. Grouping emotions into 4 classes, an average recognition rate of 69% was achieved with the audio features, 98% with the visual features and 98% with the audio-visual features fused at decision level. The results were comparable to the measured human recognition rate with this multimodal data set.

The QESTRAL model is a perceptual model that aims to predict changes to spatial quality of service between a reference system and an impaired version of the reference system. To achieve this, the model required calibration using perceptual data from human listeners. This paper describes the development, implementation and outcomes of a series of listening experiments designed to investigate the spatial quality impairment of 40 processes. Assessments were made using a multi-stimulus test paradigm with a label-free scale, where only the scale polarity is indicated. The tests were performed at two listening positions, using experienced listeners. Results from these calibration experiments are presented. A preliminary study on the process of selecting of stimuli is also discussed.

We describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g., HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.

A compact, data-driven statistical model for identifying roles played by articulators in production of English phones using 1D and 2D articulatory data is presented. Articulators critical in production of each phone were identified and were used to predict the pdfs of dependent articulators based on the strength of articulatory correlations. The performance of the model is evaluated on MOCHA database using proposed and exhaustive search techniques and the results of synthesised trajectories presented.

Since the mid 1990s, acoustics research has been undertaken relating to the sound zone problem?using loudspeakers to deliver a region of high sound pressure while simultaneously creating an area where the sound is suppressed?in order to facilitate independent listening within the same acoustic enclosure. The published solutions to the sound zone problem are derived from areas such as wave field synthesis and beamforming. However, the properties of such methods differ and performance tends to be compared against similar approaches. In this study, the suitability of energy focusing, energy cancelation, and synthesis approaches for sound zone reproduction is investigated. Anechoic simulations based on two zones surrounded by a circular array show each of the methods to have a characteristic performance, quantified in terms of acoustic contrast, array control effort and target sound field planarity. Regularization is shown to have a significant effect on the array effort and achieved acoustic contrast, particularly when mismatched conditions are considered between calculation of the source weights and their application to the system.

This paper describes a computational model for the prediction of perceived spatial quality for reproduced sound at arbitrary locations in the listening area. The model is specifically designed to evaluate distortions in the spatial domain such as changes in source location, width and envelopment. Maps of perceived spatial quality across the listening area are presented from our initial results.

It is of interest to create regions of increased and reduced sound pressure ('sound zones') in an enclosure such that different audio programs can be simultaneously delivered over loudspeakers, thus allowing listeners sharing a space to receive independent audio without physical barriers or headphones. Where previous comparisons of sound zoning techniques exist, they have been conducted under favorable acoustic conditions, utilizing simulations based on theoretical transfer functions or anechoic measurements. Outside of these highly specified and controlled environments, real-world factors including reflections, measurement errors, matrix conditioning and practical filter design degrade the realizable performance. This study compares the performance of sound zoning techniques when applied to create two sound zones in simulated and real acoustic environments. In order to compare multiple methods in a common framework without unduly hindering performance, an optimization procedure for each method is first used to select the best loudspeaker positions in terms of robustness, efficiency and the acoustic contrast deliverable to both zones. The characteristics of each control technique are then studied, noting the contrast and the impact of acoustic conditions on performance.

Studies on sound field control methods able to create independent listening zones in a single acoustic space have recently been undertaken due to the potential of such methods for various practical applications, such as individual audio streams in home entertainment. Existing solutions to the problem have shown to be effective in creating high and low sound energy regions under anechoic conditions. Although some case studies in a reflective environment can also be found, the capabilities of sound zoning methods in rooms have not been fully explored. In this paper, the influence of low-order (early) reflections on the performance of key sound zone techniques is examined. Analytic considerations for small-scale systems reveal strong dependence of performance on parameters such as source positioning with respect to zone locations and room surfaces, as well as the parameters of the receiver configuration. These dependencies are further investigated through numerical simulation to determine system configurations which maximize the performance in terms of acoustic contrast and array control effort. The design rules for source and receiver positioning are suggested, for improved performance under a given set of constraints such as a number of available sources, zone locations, and the direction of the dominant reflection.

In existing audio-visual blind source separation (AV-BSS) algorithms, the AV coherence is usually established through statistical modelling, using e.g. Gaussian mixture models (GMMs). These methods often operate in a lowdimensional feature space, rendering an effective global representation of the data. The local information, which is important in capturing the temporal structure of the data, however, has not been explicitly exploited. In this paper, we propose a new method for capturing such local information, based on audio-visual dictionary learning (AVDL). We address several challenges associated with AVDL, including cross-modality differences in size, dimension and sampling rate, as well as the issues of scalability and computational complexity. Following a commonly employed bootstrap coding-learning process, we have developed a new AVDL algorithm which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity. We also show how the proposed AVDL can be incorporated into a BSS algorithm. As an example, we consider binaural mixtures, mimicking aspects of human binaural hearing, and derive a new noise-robust AV-BSS algorithm by combining the proposed AVDL algorithm with Mandel?s BSS method, which is a state-of-the-art audio-domain method using time-frequency masking. We have systematically evaluated the proposed AVDL and AV-BSS algorithms, and show their advantages over the corresponding baseline methods, using both synthetic data and visual speech data from the multimodal LILiR Twotalk corpus.

A simple multiple-level HMM is presented in which speech dynamics are modelled
as linear trajectories in an intermediate, formant-based representation and the mapping
between the intermediate and acoustic data is achieved using one or more linear
transformations. An upper-bound on the performance of such a system is established.
Experimental results on the TIMIT corpus demonstrate that, if the dimension of the
intermediate space is suficiently high or the number of articulatory-to-acoustic mappings is
sufjciently large, then this upper-bound can be achieved.

Most of the binaural source separation algorithms only consider the dissimilarities between the recorded mixtures such as interaural phase and level differences (IPD, ILD) to classify and assign the time-frequency (T-F) regions of the mixture spectrograms to each source. However, in this paper we show that the coherence between the left and right recordings can provide extra information to label the T-F units from the sources. This also reduces the effect of reverberation which contains random reflections from different directions showing low correlation between the sensors. Our algorithm assigns the T-F regions into original sources based on weighted combination of IPD, ILD, the observation vectors models and the estimated interaural coherence (IC) between the left and right recordings. The binaural room impulse responses measured in four rooms with various acoustic conditions have been used to evaluate the performance of the proposed method which shows an improvement of more than 1:4 dB in signal-to-distortion ratio (SDR) in room D with T60 = 0:89 s over the state-of-the-art algorithms.

This paper describes a computational model for the prediction of perceived spatial quality for reproduced sound at arbitrary locations in the listening area. The model is specifically designed to evaluate distortions in the spatial domain such as changes in source location, width and envelopment. Maps of perceived spatial quality across the listening area are presented from our initial results.

A statistical technique for identifying critical, dependent and redundant articulators in English phones was applied to 1D and 2D distributions of articulatograph coordinates. Results compared well with phonetic descriptions from the IPA chart with some interesting findings for fricatives and alveolar stops. An extension of the method is discussed.

A multiplanar Dynamic Magnetic Resonance Imaging (MRI) technique that extends our earlier work on single-plane Dynamic MRI is described. Scanned images acquired while an utterasne is repeated are recombined to form pseudo-time-varying images of the vocal tract using a simultaneously recorded audio signal. There is no technical limit on the utterance length or number of slices that can be so imaged, though the number of repetitions required may be limited by the subject's stamina. An example of [pasi] imaged in three sagittal planes is shown; with a Signa GE 0.5T MR scanner, 360 tokens were reconstructed to form a sequence of 39 3-slice 16ms frames. From these, a 3-D volume was generated for each time frame, and tract surfaces outlined manually. Parameters derived from these include: palate-tongue distances for [a,s,i]; estimates of tongue volume and of the area function using only the midsagittal, and then all three slices. These demonstrate the accuracy and usefulness of the technique.

Sound zone systems aim to produce regions within a room where listeners may consume separate audio programs with minimal acoustical interference. Often, there is a trade-off between the acoustic contrast achieved between the zones, and the fidelity of the reproduced audio program (the target quality). An open question is whether reducing contrast (i.e. allowing greater interference) can improve target quality. The planarity control sound zoning method can be used to improve spatial reproduction, though at the expense of decreased contrast. Hence, this can be used to investigate the relationship between target quality (which is affected by the spatial presentation) and distraction (which is related to the perceived effect of interference). An experiment was conducted investigating target quality and distraction, and examining their relationship with overall quality within sound zones. Sound zones were reproduced using acoustic contrast control, planarity control and pressure matching applied to a circular loudspeaker array. Overall quality was related to target quality and distraction, each having a similar magnitude of effect; however, the result was dependent upon program combination. The highest mean overall quality was a compromise between distraction and target quality, with energy arriving from up to 15 degrees either side of the target direction.

Recent studies show that visual information contained in visual speech can be helpful for the performance enhancement of audio-only
blind source separation (BSS) algorithms. Such information is exploited through the statistical characterisation of the coherence between the audio and visual speech using, e.g. a Gaussian mixture model (GMM).
In this paper, we present two new contributions. An adapted expectation maximization (AEM) algorithm is proposed in the training process
to model the audio-visual coherence upon the extracted features. The coherence is exploited to solve the permutation problem in the frequency
domain using a new sorting scheme. We test our algorithm on the XM2VTS multimodal database. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS.

Human communication is based on verbal and nonverbal information, e.g., facial expressions and intonation cue the speaker?s emotional state. Important speech features for emotion recognition are prosody (pitch, energy and duration) and voice quality (spectral energy, formants, MFCCs, jitter/shimmer). For facial expressions, features related to forehead, eye region, cheek and lip are important. Both audio and visual modalities provide relevant cues. Thus, audio and visual features were extracted and combined to evaluate emotion recognition on a British English corpus. The database of 120 utterances was recorded from an actor with 60 markers painted on his face, reading sentences in seven emotions (N=7): anger, disgust, fear, happiness, neutral, sadness and surprise. Recordings consisted of 15 phonetically-balanced TIMIT sentences per emotion, and video of the face captured by a 3dMD system. A total of 106 utterance-level audio features (prosodic and spectral) and 240 visual features (2D marker coordinates) were extracted. Experiments were performed with audio, visual and audiovisual features. The top 40 features were selected by sequential forward backward search using Bhattacharyya distance criterion. PCA and LDA transformations, calculated on the training data, were applied. Gaussian classifiers were trained with PCA and LDA features. Data was jack-knifed with 5 sets for training and 1 set for testing. Results were averaged over 6 tests. The emotion recognition accuracy was higher for visual features than audio features, for both PCA and LDA. Audiovisual results were close to those with visual features. Higher performance was achieved with LDA compared to PCA. The best recognition rate, 98%, was achieved for 6 LDA features (N-1) with audiovisual and visual features, whereas audio LDA scored 53%. Maximum PCA results for audio, visual and audiovisual features were 41%, 97% and 88% respectively. Future work involves experiments with more subjects and investigating the correlation between vocal and facial expressions of emotion.

An objective prediction model for the sensation of sound envelopment in five-channel reproduction is important for evaluating spatial quality. Regression analysis was used to map the listening test scores on a variety of audio sources and the objective measures extracted from the recordings themselves. By following an iterative process, a prediction model with five features was constructed. The validity of the model was tested in a second set of subjective scores and showed a correlation coefficient of 0.9. Among the five features: sound distribution and interaural cross-correlation contributed substantially to the sensation of envelopment. The model did not require access to the original audio. Scales used for listening tests were defined by audible anchors.

Audio from broadcast soccer can be used for identifying highlights from the game. We can assume that the basic construction of the auditory scene consists of two additive parallel audio streams, one relating to commentator speech and the other relating to audio captured from the ground level microphones. Audio cues derived from these sources provide valuable information about game events, as can the detection of key words used by the commentators, which are useful for identifying highlights. We investigate word recognition in a connected digit experiment providing additive noise that is present in broadcast soccer audio. A limited set of background soccer noises, extracted from the FIFA World Cup 2006 recordings, were used to create an extension to the Aurora-2 database. The extended data set was tested with various HMM and parallel model combination (PMC) configurations, and compared to the standard baseline, with clean and multi-condition training methods. It was found that incorporating SNR and noise type information into the PMC process was beneficial to recognition performance with a reduction in word error rate from 17.5% to 16.3% over the next best scheme when using the SNR information.Future work will look at non stationary soccer noise types and multiple statenoise models.

The QESTRAL project aims to develop an artificial listener for comparing the perceived quality of a spatial audio reproduction against a reference reproduction. This paper presents implementation details for simulating the acoustics of the listening environment and the listener?s auditory processing. Acoustical modeling is used to calculate binaural signals and simulated microphone signals at the listening position, from which a number of metrics corresponding to different perceived spatial aspects of the reproduced sound field are calculated. These metrics are designed to describe attributes associated with location, width and envelopment attributes of a spatial sound scene. Each provides a measure of the perceived spatial quality of the impaired reproduction compared to the reference reproduction. As validation, individual metrics from listening test signals are shown to match closely subjective results obtained, and can be used to predict spatial quality for arbitrary signals.

Techniques such as multi-point optimization, wave field synthesis and ambisonics attempt to create spatial effects by synthesizing a sound field over a listening region. In this paper, we propose planarity panning, which uses superdirective microphone array beamforming to focus the sound from the specified direction, as an alternative approach. Simulations compare performance against existing strategies, considering the cases where the listener is central and non-central in relation to a 60 channel circular loudspeaker array. Planarity panning requires low control effort and provides high sound field planarity over a large frequency range, when the zone positions match the target regions specified for the filter calculations. Future work should implement and validate the perceptual properties of the method.

We present an analysis of linear feature extraction techniques to derive a compact and meaningful representation of the articulatory data. We used 14-channel EMA (ElectroMagnetic Articulograph) data from two speakers from the MOCHA database [A.A. Wrench. A new resource for production modelling in speech technology. In Proc. Inst. of Acoust., Stratford-upon-Avon, UK, 2001.]. As representations, we considered the registered articulator fleshpoint coordinates, transformed PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis) features. Various PCA schemes were considered, grouping coordinates according to correlations amongst the articulators. For each phone, critical dimensions were identified using the algorithm in [Veena D Singampalli and Philip JB Jackson. Statistical identification of critical, dependent and redundant articulators. In Proc. Interspeech, Antwerp, Belgium, pages 70-73, 2007.]: critical articulators with registered coordinates, and critical modes with PCA and LDA. The phone distributions in each representation were modelled as univariate Gaussians and the average number of critical dimensions was controlled using a threshold on the 1-D Kullback Leibler divergence (identification divergence). The 14-D KL divergence (evaluation divergence) was employed to measure goodness of fit of the models to estimated phone distributions. Phone recognition experiments were performed using coordinate, PCA and LDA features, for comparison. We found that, of all representations, the LDA space yielded the best fit between the model and phone pdfs. The full PCA representation (including all articulatory coordinates) gave the next best fit, closely followed by two other PCA representations that allowed for correlations across the tongue. At the threshold where average number of critical dimensions matched those obtained from IPA, the goodness of fit improved by 34% (22%/46% for male/female data) when LDA was used over the best PCA representation, and by 72% (77%/66%) over articulatory coordinates. For PCA and LDA, the compactness of the representation was investigated by discarding the least significant modes. No significant change in the recognition performance was found as the dimensionality was reduced from 14 to 8 (95% confidence t-test), although accuracy deteriorated as further modes were discarded. Evaluation divergence also reflected this pattern. Experiments on LDA features increased recognition accuracy by 2% on average over the bes

For subjective experimentation on 3D audio systems, suitable programme material is needed. A large-scale recording session was performed in which four ensembles were recorded with a range of existing microphone techniques (aimed at mono, stereo, 5.0, 9.0, 22.0, ambisonic, and headphone reproduction) and a novel 48-channel circular microphone array. Further material was produced by remixing and augmenting pre-existing multichannel content. To mix and monitor the programme items (which included classical, jazz, pop and experimental music, and excerpts from a sports broadcast and a lm soundtrack), a flexible 3D audio reproduction environment was created. Solutions to the following challenges were found: level calibration for different reproduction formats; bass management; and adaptable signal routing from different software and fille formats.

The QESTRAL project has developed an artificial listener that compares the perceived quality of a spatial audio reproduction to a reference reproduction. Test signals designed to identify distortions in both the foreground and background audio streams are created for both the reference and the impaired reproduction systems. Metrics are calculated from these test signals and are then combined using a regression model to give a measure of the overall perceived spatial quality of the impaired reproduction compared to the reference reproduction. The results of the model are shown to match closely the results obtained in listening tests. Consequently, the model can be used as an alternative to listening tests when evaluating the perceived spatial quality of a given reproduction system, thus saving time and expense.

Probabilistic models of binaural cues, such as the interaural phase difference (IPD) and the interaural level difference (ILD), can be used to obtain the audio mask in the time-frequency (TF) domain, for source separation of binaural mixtures. Those models are, however, often degraded by acoustic noise. In contrast, the video stream contains relevant information about the synchronous audio stream that is not affected by acoustic noise. In this paper, we present a novel method for modeling the audio-visual (AV) coherence based on dictionary learning. A visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio-visual mask, which is then applied to the binaural signal for source separation. We tested our algorithm on the XM2VTS database, and observed considerable performance improvement for noise corrupted signals.

This paper presents a new method for reverberant speech separation, based on the combination of binaural cues and blind source separation (BSS) for the automatic classification of the time-frequency (T-F) units of the speech mixture spectrogram. The main idea is to model interaural phase difference, interaural level difference and frequency bin-wise mixing vectors by Gaussian mixture models for each source
and then evaluate that model at each T-F point and assign the units with high probability to that source. The model parameters and the assigned regions are refined iteratively
using the Expectation-Maximization (EM) algorithm. The proposed method also addresses the permutation problem of the frequency domain BSS by initializing the mixing vectors
for each frequency channel. The EM algorithm starts with binaural cues and after a few iterations the estimated probabilistic mask is used to initialize and re-estimate the mix-
ing vector model parameters. We performed experiments on speech mixtures, and showed an average of about 0.8 dB improvement in signal-to-distortion (SDR) over the binaural-only baseline

We present a framework for speech-driven synthesis of real faces from a corpus of 3D video of a person speaking. Video-rate capture of dynamic 3D face shape and colour appearance provides the basis for a visual speech synthesis model. A displacement map representation combines face shape and colour into a 3D video. This representation is used to efficiently register and integrate shape and colour information captured from multiple views. To allow visual speech synthesis viseme primitives are identified from the corpus using automatic speech recognition. A novel nonrigid alignment algorithm is introduced to estimate dense correspondence between 3D face shape and appearance for different visemes. The registered displacement map representation together with a novel optical flow optimisation using both shape and colour, enables accurate and efficient nonrigid alignment. Face synthesis from speech is performed by concatenation of the corresponding viseme sequence using the nonrigid correspondence to reproduce both 3D face shape and colour appearance. Concatenative synthesis reproduces both viseme timing and co-articulation. Face capture and synthesis has been performed for a database of 51 people. Results demonstrate synthesis of 3D visual speech animation with a quality comparable to the captured video of a person.

Object-based audio is gaining momentum as a means for future audio content to be more immersive,
interactive, and accessible. Recent standardization developments make recommendations for object
formats, however, the capture, production and reproduction of reverberation is an open issue. In this
paper, parametric approaches for capturing, representing, editing, and rendering reverberation over a
3D spatial audio system are reviewed. A framework is proposed for a Reverberant Spatial Audio Object
(RSAO), which synthesizes reverberation inside an audio object renderer. An implementation example
of an object scheme utilising the RSAO framework is provided, and supported with listening test
results, showing that: the approach correctly retains the sense of room size compared to a convolved
reference; editing RSAO parameters can alter the perceived room size and source distance; and,
format-agnostic rendering can be exploited to alter listener envelopment.

This paper describes the development of an unintrusive objective model, developed independently as a part of the QESTRAL project, for predicting the sensation of envelopment arising from commercially available 5-channel surround sound recordings. The model was calibrated using subjective scores obtained from listening tests that used a grading scale defined by audible anchors. For predicting subjective scores, a number of features based on Interaural Cross Correlation (IACC), Karhunen-Loeve Transform (KLT) and signal energy levels were extracted from recordings. The ridge regression technique was used to build the objective model and a calibrated model was validated using a listening test scores database obtained from a different group of listeners, stimuli and location. The initial results showed a high correlation between predicted and actual scores obtained from the listening tests.

The independent vector analysis (IVA) algorithm employs a multivariate source prior to retain the dependency between different frequency bins of each source and thereby avoid the permutation problem that is inherent to blind source separation (BSS). In this paper, a multivariate Student?s t distribution is adopted as the source prior, which because of its heavy tail nature can better model the large amplitude information in the frequency bins. Therefore it can improve the separation performance and the convergence speed of the IVA and fast version of the IVA (FastIVA) algorithms as compared with the original IVA algorithm based on another multivariate super-Gaussian source prior. Separation performance with real binaural room impulse responses (BRIRs) is evaluated by detailed simulation studies when using the different source priors, and the experimental results confirm that the IVA and the FastIVA with the proposed multivariate Student?s t source prior can consistently achieve improved and faster separation performance.

A decomposition algorithm that uses a pitch-scaled harmonic filter was evaluated using synthetic signals and applied to mixed-source speech, spoken by three subjects, to separate the voiced and unvoiced parts. Pulsing of the noise component was observed in voiced frication, which was analyzed by complex demodulation of the signal envelope. The timing of the pulsation, represented by the phase of the anharmonic modulation coefficient, showed a step change during a vowel-fricative transition corresponding to the change in location of the sound source within the vocal tract. Analysis of fricatives //[phonetic beta], v, [edh], z, [yog], [vee with swirl], [backward glottal stop]// demonstrated a relationship between steady-state phase and place, and f0 glides confirmed that the main cause was a place-dependent delay.

This thesis is a study of the production of human speech sounds by acoustic modelling and signal analysis. It concentrates on sounds that are not produced by voicing (although that may be present), namely plosives, fricatives and aspiration, which all contain noise generated by flow turbulence. It combines the application of advanced speech analysis techniques with acoustic flow-duct modelling of the vocal tract, and draws on dynamic magnetic resonance image (dMRI) data of the pharyngeal and oral cavities, to relate the sounds to physical shapes.

Having superimposed vocal-tract outlines on three sagittal dMRI slices of an adult male subject, a simple description of the vocal tract suitable for acoustic modelling was derived through a sequence of transformations. The vocal-tract acoustics program VOAC, which relaxes many of the assumptions of conventional plane-wave models, incorporates the effects of net flow into a one-dimensional model (viz., flow separation, increase of entropy, and changes to resonances), as well as wall vibration and cylindrical wavefronts. It was used for synthesis by computing transfer functions from sound sources specified within the tract to the far field.

Being generated by a variety of aero-acoustic mechanisms, unvoiced sounds are somewhat varied in nature. Through analysis that was informed by acoustic modelling, resonance and anti-resonance frequencies of ensemble-averaged plosive spectra were examined for the same subject, and their trajectories observed during release. The anti-resonance frequencies were used to compute the place of occlusion.

In vowels and voiced fricatives, voicing obscures the aspiration and frication components. So, a method was devised to separate the voiced and unvoiced parts of a speech signal, the pitch-scaled harmonic filter (PSHF), which was tested extensively on synthetic signals. Based on a harmonic model of voicing, it outputs harmonic and anharmonic signals appropriate for subsequent analysis as time series or as power spectra. By applying the PSHF to sustained voiced fricatives, we found that, not only does voicing modulate the production of frication noise, but that the timing of pulsation cannot be explained by acoustic propagation alone.

In addition to classical investigation of voiceless speech sounds, VOAC and the PSHF demonstrated their practical value in helping further to characterise plosion, frication and aspiration noise. For the future, we discuss developing VOAC within an arti

Motion capture (mocap) is widely used in a large number of industrial applications. Our work offers a new way of representing the mocap facial dynamics in a high resolution 3D morphable model expression space. A data-driven
approach to modelling of facial dynamics is presented. We propose a way to combine high quality static face scans with dynamic 3D mocap data which has lower spatial resolution in
order to study the dynamics of facial expressions.

Most current perceptual models for audio quality have so far tended to concentrate on the audibility of distortions and noises that mainly affect the timbre of reproduced sound. The QESTRAL model, however, is specifically designed to take account of distortions in the spatial domain such as changes in source location, width and envelopment. It is not aimed only at codec quality evaluation but at a wider range of spatial distortions that can arise in audio processing and reproduction systems. The model has been calibrated against a large database of listening tests designed to evaluate typical audio processes, comparing spatially degraded multichannel audio material against a reference. Using a range of relevant metrics and a sophisticated multivariate regression model, results are obtained that closely match those obtained in listening tests.

In this paper we describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system, and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g. HMMs, neural nets) with unit selection we improve the quality of our speech synthesis.

The aperiodic noise source in fricatives is characteristically amplitude modulated by voicing. Previous psychoacoustic studies have established that observed levels of AM in voiced fricatives are detectable, and its inclusion in synthesis has improved speech quality. Phonological voicing in fricatives can be cued by a number of factors: the voicing fundamental, duration of any devoicing, duration of frication, and formant transitions. However, the possible contribution of AM has not been investigated. In a cue trading experiment, subjects distinguished between the nonsense words ?ahser? and ?ahzer?. The voicing boundary was measured along a formant-transition duration continuum, as a function of AM depth, voicing amplitude and masking of the voicing component by low-frequency noise. The presence of AM increased voiced responses by approximately 30%. The ability of AM to cue voicing was strongest at greater modulation depths and when voicing was unavailable as a cue, as might occur in telecommunication systems or noisy environments. Further work would examine other fricatives and phonetic contexts, as well as interaction with other cues.

This work aims to improve the quality of visual speech synthesis by modelling its emotional characteristics. The emotion specific speech content is analysed based on the 3D video dataset of expressive speech. Preliminary results indicate a promising relation between the chosen features of visual speech and emotional content.

The topic of sound zone reproduction, whereby listeners sharing an acoustic space can receive personalized audio content, has been researched for a number of years. Recently, a number of sound zone systems have been realized, moving the concept towards becoming a practical reality. Current implementations of sound zone systems have relied upon conventional loudspeaker geometries such as linear and circular arrays. Line arrays may be compact, but do not necessarily give the system the opportunity to compensate for room reflections in real-world environments. Circular arrays give this opportunity, and also give greater flexibility for spatial audio reproduction, but typically require large numbers of loudspeakers in order to reproduce sound zones over an acceptable bandwidth. Therefore, one key area of research standing between the ideal capability and the performance of a physical system is that of establishing the number and location of the loudspeakers comprising the reproduction array. In this study, the topic of loudspeaker configurations was considered for two-zone reproduction, using a circular array of 60 loudspeakers as the candidate set for selection. A numerical search procedure was used to select a number of loudspeakers from the candidate set. The novel objective function driving the search comprised terms relating to the acoustic contrast between the zones, array effort, matrix condition number, and target zone planarity. The performance of the selected sets using acoustic contrast control was measured in an acoustically treated studio. Results demonstrate that the loudspeaker selection process has potential for maximising the contrast over frequency by increasing the minimum contrast over the frequency range 100--4000 Hz. The array effort and target planarity can also be optimised, depending on the formulation of the objective function. Future work should consider greater diversity of candidate locations.

We present a novel method for speech separation from their audio mixtures using the audio-visual coherence. It consists of two stages: in the off-line training process, we use the Gaussian mixture model to characterise statistically the audiovisual coherence with features obtained from the training set; at the separation stage, likelihood maximization is performed on the independent component analysis (ICA)-separated spectral components. To address the permutation and scaling indeterminacies of the frequency-domain blind source separation (BSS), a new sorting and rescaling scheme using the bimodal coherence is proposed.We tested our algorithm on the XM2VTS database, and the results show that our algorithm can address the permutation problem with high accuracy, and mitigate the scaling problem effectively.

Information from video has been used recently to address the issue of scaling
ambiguity in convolutive blind source separation (BSS) in the frequency domain,
based on statistical modeling of the audio-visual coherence with Gaussian
mixture models (GMMs) in the feature space. However, outliers in the feature
space may greatly degrade the system performance in both training and
separation stages. In this paper, a new feature selection scheme is proposed to
discard non-stationary features, which improves the robustness of the coherence
model and reduces its computational complexity. The scaling parameters obtained
by coherence maximization and non-linear interpolation from the selected
features are applied to the separated frequency components to mitigate the
scaling ambiguity. A multimodal database composed of different combinations of
vowels and consonants was used to test our algorithm. Experimental results show
the performance improvement with our proposed algorithm.

The two distinct sound sources comprising voiced frication, voicing and frication, interact. One effect is that the periodic source at the glottis modulates the amplitude of the frication source originating in the vocal tract above the constriction. Voicing strength and modulation depth for frication noise were measured for sustained English voiced fricatives using high-pass filtering, spectral analysis in the modulation (envelope) domain, and a variable pitch compensation procedure. Results show a positive relationship between strength of the glottal source and modulation depth at voicing strengths below 66 dB SPL, at which point the modulation index was approximately 0.5 and saturation occurred. The alveolar [z] was found to be more modulated than other fricatives.

Natural-sounding reproduction of sound over headphones requires accurate estimation of an individual?s Head-Related Impulse Responses (HRIRs), capturing details relating to the size and shape of the body, head and ears. A stereo-vision face capture system was used to obtain 3D geometry, which provided surface data for boundary element method (BEM) acoustical simulation. Audio recordings were filtered by the output HRIRs to generate samples for a comparative listening test alongside samples generated with dummy-head HRIRs. Preliminary assessment showed better localization judgements with the personalized HRIRs by the corresponding participant, whereas other listeners performed better with dummy-head HRIRs, which is consistent with expectations for personalized HRIRs. The use of visual measurements for enhancing users? auditory experience merits investigation with additional participants.

Planarity panning (PP) and planarity control (PC) have previously been shown to be efficient methods for focusing directional sound energy into listening zones. In this paper, we consider sound field control for two listeners. First, PP is extended to create spatial audio for two listeners consuming the same spatial audio content. Then, PC is used to create highly directional sound and cancel interfering audio. Simulation results compare PP and PC against pressure matching (PM) solutions. For multiple listeners listening to the same content, PP creates directional sound at lower effort than the PM counterpart. When listeners consume different audio, PC produces greater acoustic contrast than PM, with excellent directional control except for frequencies where grating lobes generate problematic interference patterns.

The ability to predict the acoustics of a room without acoustical measurements is a useful capability. The motivation here stems from spatial audio reproduction, where knowledge of the acoustics of a space could allow for more accurate reproduction of a captured environment, or for reproduction room compensation techniques to be applied. A cuboid-based room geometry estimation method using a spherical camera is proposed, assuming a room and objects inside can be represented as cuboids aligned to the main axes of the coordinate system. The estimated geometry is used to produce frequency-dependent acoustic predictions based on geometrical room modelling techniques. Results are compared to measurements through calculated reverberant spatial audio object parameters used for reverberation reproduction customized to the given loudspeaker set up.

Studies on perceived audio-visual spatial coherence in the literature have commonly employed continuous judgment scales. This method requires listeners to detect and to quantify their perception of a given feature and is a difficult task, particularly for untrained listeners. An alternative method is the quantification of a percept by conducting a simple forced choice test with subsequent modeling of the psychometric function. An experiment to validate this alternative method for the perception of azimuthal audio-visual spatial coherence was performed. Furthermore, information on participant training and localization ability was gathered. The results are consistent with previous research and show that the proposed methodology is suitable for this kind of test. The main differences between participants result from the presence or absence of musical training.

Estimating and parameterizing the early and late reflections of an enclosed space is an interesting topic in acoustics. With a suitable set of parameters, the current concept of a spatial audio object (SAO), which is typically limited to either direct (dry) sound or diffuse field components, could be extended to afford an editable spatial description of the room acoustics. In this paper we present an analysis/synthesis method for parameterizing a set of measured room impulse responses (RIRs). RIRs were recorded in a medium-sized auditorium, using a uniform circular array of microphones representing the perspective of a listener in the front row. During the analysis process, these RIRs were decomposed, in time, into three parts: the direct sound, the early reflections, and the late reflections. From the direct sound and early reflections, parameters were extracted for the length, amplitude, and direction of arrival (DOA) of the propagation paths by exploiting the dynamic programming projected phase-slope algorithm (DYPSA) and classical delay-and-sum beamformer (DSB). Their spectral envelope was calculated using linear predictive coding (LPC). Late reflections were modeled by frequency-dependent decays excited by band-limited Gaussian noise. The combination of these parameters for a given source position and the direct source signal represents the reverberant or ?wet? spatial audio object. RIRs synthesized for a specified rendering and reproduction arrangement were convolved with dry sources to form reverberant components of the sound scene. The resulting signals demonstrated potential for these techniques, e.g., in SAO reproduction over a 22.2 surround sound system.

Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic reflectors given room impulse responses (RIRs). In this article, we categorize these methods as ?image-source reversion?, which localizes the image source before finding the reflector position, and ?direct localization?, which localizes the reflector without intermediate steps. We present five new contributions. First, an onset detector, called the clustered dynamic programming projected phase-slope algorithm, is proposed to automatically extract the time of arrival for early reflections within the RIRs of a compact microphone array. Second, we propose an image-source reversion method that uses the RIRs from a single loudspeaker. It is constructed by combining an image source locator (the image source direction and range (ISDAR) algorithm), and a reflector locator (using the loudspeaker-image bisection (LIB) algorithm). Third, two variants of it, exploiting multiple loudspeakers, are proposed. Fourth, we present a direct localization method, the ellipsoid tangent sample consensus (ETSAC), exploiting ellipsoid properties to localize the reflector. Finally, systematic experiments on simulated and measured RIRs are presented, comparing the proposed methods with the state-of-the-art. ETSAC generates errors lower than the alternative methods compared through our datasets. Nevertheless, the ISDAR-LIB combination performs well and has a run time 200 times faster than ETSAC.

Multi-point approaches for sound field control generally sample the listening zone(s) with pressure
microphones, and use these measurements as an input for an optimisation cost function.
A number of techniques are based on this concept, for single-zone (e.g. least-squares pressure
matching (PM), brightness control, planarity panning) and multi-zone (e.g. PM, acoustic contrast
control, planarity control) reproduction. Accurate performance predictions are obtained when distinct
microphone positions are employed for setup versus evaluation. While, in simulation, one
can afford a dense sampling of virtual microphones, it is desirable in practice to have a microphone
array which can be positioned once in each zone to measure the setup transfer functions
between each loudspeaker and that zone. In this contribution, we present simulation results over
a fixed dense set of evaluation points comparing the performance of several multi-point optimisation
approaches for 2D reproduction with a 60 channel circular loudspeaker arrangement. Various
regular setup microphone arrays are used to calculate the sound zone filters: circular grid, circular,
dual-circular, and spherical arrays, each with different numbers of microphones. Furthermore, the
effect of a rigid spherical baffle is studied for the circular and spherical arrangements. The results
of this comparative study show how the directivity and effective frequency range of multi-point
optimisation techniques depend on the microphone array used to sample the zones. In general,
microphone arrays with dense spacing around the boundary give better angular discrimination,
leading to more accurate directional sound reproduction, while those distributed around the whole
zone enable more accurate prediction of the reproduced target sound pressure level.

Recent work into 3D audio reproduction has considered the definition of a set of parameters to encode
reverberation into an object-based audio scene. The reverberant spatial audio object (RSAO)
describes the reverberation in terms of a set of localised, delayed and filtered (early) reflections,
together with a late energy envelope modelling the diffuse late decay. The planarity metric, originally
developed to evaluate the directionality of reproduced sound fields, is used to analyse a set of
multichannel room impulse responses (RIRs) recorded at a microphone array. Planarity describes
the spatial compactness of incident sound energy, which tends to decrease as the reflection density
and diffuseness of the room response develop over time. Accordingly, planarity complements
intensity-based diffuseness estimators, which quantify the degree to which the sound field at a
discrete frequency within a particular time window is due to an impinging coherent plane wave.
In this paper, we use planarity as a tool to analyse the sound field in relation to the RSAO parameters.
Specifically, we use planarity to estimate two important properties of the sound field. First,
as high planarity identifies the most localised reflections along the RIR, we estimate the most
planar portions of the RIR, corresponding to the RSAO early reflection model and increasing the
likelihood of detecting prominent specular reflections. Second, as diffuse sound fields give a low
planarity score, we investigate planarity for data-based mixing time estimation. Results show
that planarity estimates on measured multichannel RIR datasets represent a useful tool for room
acoustics analysis and RSAO parameterisation.

Room Impulse Responses (RIRs) measured with microphone arrays capture spatial and nonspatial
information, e.g. the early reflections? directions and times of arrival, the size of the
room and its absorption properties. The Reverberant Spatial Audio Object (RSAO) was proposed
as a method to encode room acoustic parameters from measured array RIRs. As the RSAO is
object-based audio compatible, its parameters can be rendered to arbitrary reproduction systems
and edited to modify the reverberation characteristics, to improve the user experience. Various
microphone array designs have been proposed for sound field and room acoustic analysis, but a
comparative performance evaluation is not available. This study assesses the performance of five
regular microphone array geometries (linear, rectangular, circular, dual-circular and spherical) to
capture RSAO parameters for the direct sound and early reflections of RIRs. The image source
method is used to synthesise RIRs at the microphone positions as well as at the centre of the array.
From the array RIRs, the RSAO parameters are estimated and compared to the reference parameters
at the centre of the array. A performance comparison among the five arrays is established
as well as the effect of a rigid spherical baffle for the circular and spherical arrays. The effects
of measurement uncertainties, such as microphone misplacement and sensor noise errors, are also
studied. The results show that planar arrays achieve the most accurate horizontal localisation
whereas the spherical arrays perform best in elevation. Arrays with smaller apertures achieve a
higher number of detected reflections, which becomes more significant for the smaller room with
higher reflection density.

Recent work on a reverberant spatial audio object (RSAO) encoded spatial room impulse responses
(RIRs) as object-based metadata which can be synthesized in an object-based renderer. Encoding
reverberation into metadata presents new opportunities for end users to interact with and personalize
reverberant content. The RSAO models an RIR as a set of early re
ections together with a late
reverberation filter. Previous work to encode the RSAO parameters was based on recordings made
with a dense array of omnidirectional microphones. This paper describes RSAO parameterization from
first-order Ambisonic (B-Format) RIRs, making the RSAO compatible with existing spatial reverb
libraries. The object-based implementation achieves reverberation time, early decay time, clarity and
interaural cross-correlation similar to direct Ambisonic rendering of 13 test RIRs.

Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multi-label classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into the DNN to perform a multi-label classification for the expected tags, considering that only chunk (or utterance) level rather than frame-level labels are available. Dropout and background noise aware training are also adopted to improve the generalization capability of the DNNs. For the unsupervised feature learning, we propose to use a symmetric or asymmetric deep de-noising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-Filter Banks (MFBs) features. The new features, which are smoothed against background noise and more compact with contextual information, can further improve the performance of the DNN baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of the DCASE 2016 audio tagging challenge, our proposed method obtains a significant equal error rate (EER) reduction from 0.21 to 0.13 on the development set. The proposed asyDAE system can get a relative 6.7% EER reduction compared with the strong DNN baseline on the development set. Finally, the results also show that our approach obtains the state-of-the-art performance with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while EER of the first prize of this challenge is 0.17.

Deep neural networks (DNN) have recently been
shown to give state-of-the-art performance in monaural speech
enhancement. However in the DNN training process, the perceptual
difference between different components of the DNN
output is not fully exploited, where equal importance is often
assumed. To address this limitation, we have proposed a new
perceptually-weighted objective function within a feedforward
DNN framework, aiming to minimize the perceptual difference
between the enhanced speech and the target speech. A perceptual
weight is integrated into the proposed objective function, and
has been tested on two types of output features: spectra and
ideal ratio masks. Objective evaluations for both speech quality
and speech intelligibility have been performed. Integration of our
perceptual weight shows consistent improvement on several noise
levels and a variety of different noise types.

Whilst it is possible to create exciting, immersive listening experiences with current spatial audio
technology, the required systems are generally difficult to install in a standard living room. However,
in any living room there is likely to already be a range of loudspeakers (such as mobile phones,
tablets, laptops, and so on). \Media device orchestration" (MDO) is the concept of utilising all
available devices to augment the reproduction of a media experience. In this demonstration, MDO is
used to augment low channel count renderings of various programme material, delivering immersive
three-dimensional audio experiences.

Automatic and fast tagging of natural sounds in audio collections is a very challenging task due to wide acoustic variations, the large number of possible tags, the incomplete and ambiguous tags provided by different labellers. To handle these problems, we use a co-regularization approach to learn a pair of classifiers on sound and text. The first classifier maps low-level audio features to a true tag list. The second classifier maps actively corrupted tags to the true tags, reducing incorrect mappings caused by low-level acoustic variations in the first classifier, and to augment the tags with additional relevant tags. Training the classifiers is implemented using marginal co-regularization, pair of which draws the two classifiers into agreement by a joint optimization. We evaluate this approach on two sound datasets, Freefield1010 and Task4 of DCASE2016. The results obtained show that marginal co-regularization outperforms the baseline GMM in both ef- ficiency and effectiveness.

State-of-the-art binaural objective intelligibility measures (OIMs) require individual source signals for making intelligibility predictions, limiting their usability in real-time online operations. This limitation may be addressed by a blind source separation (BSS) process, which is able to extract the underlying sources from a mixture. In this study, a speech source is presented with either a stationary noise masker or a fluctuating noise masker whose azimuth varies in a horizontal plane, at two speech-to-noise ratios (SNRs). Three binaural OIMs are used to predict speech intelligibility from the signals separated by a BSS algorithm. The model predictions are compared with listeners' word identification rate in a perceptual listening experiment. The results suggest that with SNR compensation to the BSS-separated speech signal, the OIMs can maintain their predictive power for individual maskers compared to their performance measured from the direct signals. It also reveals that the errors in SNR between the estimated signals are not the only factors that decrease the predictive accuracy of the OIMs with the separated signals. Artefacts or distortions on the estimated signals caused by the BSS algorithm may also be concerns.

In this paper we propose a cuboid-based air-tight indoor
room geometry estimation method using combination
of audio-visual sensors. Existing vision-based 3D reconstruction
methods are not applicable for scenes with transparent
or reflective objects such as windows and mirrors. In
this work we fuse multi-modal sensory information to overcome
the limitations of purely visual reconstruction for reconstruction
of complex scenes including transparent and
mirror surfaces. A full scene is captured by 360ý cameras
and acoustic room impulse responses (RIRs) recorded by a
loudspeaker and compact microphone array. Depth information
of the scene is recovered by stereo matching from the
captured images and estimation of major acoustic reflector
locations from the sound. The coordinate systems for audiovisual
sensors are aligned into a unified reference frame and
plane elements are reconstructed from audio-visual data.
Finally cuboid proxies are fitted to the planes to generate a
complete room model. Experimental results show that the
proposed system generates complete representations of the
room structures regardless of transparent windows, featureless
walls and shiny surfaces.

From a physical point of view, sound is classically defined by wave functions. Like every other physical model based on waves, during its propagation, it interacts with the obstacles it encounters. These interactions result in reflections of the main signal that can be defined as either being supportive or interfering. In the signal processing research field, it is, therefore, important to identify these reflections, in order to either exploit or avoid them, respectively.

The main contribution of this thesis focuses on the acoustic reflector localisation. Four novel methods are proposed: a method localising the image source before finding the reflector position; two variants of this method, which utilise information from multiple loudspeakers; a method directly localising the reflector without any pre-processing. Finally, utilising both simulated and measured data, a comparative evaluation is conducted among different acoustic reflector localisation methods. The results show the last proposed method outperforming the state-of-the-art. The second contribution of this thesis is given by applying the acoustic reflector localisation solution into spatial
audio, with the main objective of enabling the listeners with the sensation of being in the recorded environment. A novel way of encoding and decoding the room acoustic information is proposed, by parametrising sounds, and defining them as reverberant spatial audio objects (RSAOs). A set of subjective assessments are performed. The results prove both the high quality of the sound produced by the proposed parametrisation, and the reliability on manually modifying the acoustic of recorded environments. The third contribution is proposed in the field of speech source separation. A modified version of a state-of-the-art method is presented, where the direct sound and first reflection information is utilised to model binaural cues. Experiments were performed to separate speech sources in different environments. The results show the new method to outperform the state-of-the-art, where one interferer is present in the recordings.

The simulation and experimental results presented in this thesis represent a significant addition to the literature and will influence the future choices of acoustic reflector localisation systems, 3D rendering, and source separation techniques. Future work may focus on the fusion of acoustic and visual cues to enhance the acoustic scene analysis.

In object-based spatial audio system, positions of the
audio objects (e.g. speakers/talkers or voices) presented in the
sound scene are required as important metadata attributes for
object acquisition and reproduction. Binaural microphones are
often used as a physical device to mimic human hearing and to
monitor and analyse the scene, including localisation and tracking
of multiple speakers. The binaural audio tracker, however, is
usually prone to the errors caused by room reverberation and
background noise. To address this limitation, we present a
multimodal tracking method by fusing the binaural audio with
depth information (from a depth sensor, e.g., Kinect). More
specifically, the PHD filtering framework is first applied to the
depth stream, and a novel clutter intensity model is proposed
to improve the robustness of the PHD filter when an object
is occluded either by other objects or due to the limited field
of view of the depth sensor. To compensate mis-detections in
the depth stream, a novel gap filling technique is presented to
map audio azimuths obtained from the binaural audio tracker to
3D positions, using speaker-dependent spatial constraints learned
from the depth stream. With our proposed method, both the
errors in the binaural tracker and the mis-detections in the depth
tracker can be significantly reduced. Real-room recordings are
used to show the improved performance of the proposed method
in removing outliers and reducing mis-detections.

Object-based audio is an emerging representation
for audio content, where content is represented in a reproductionformat-
agnostic way and thus produced once for consumption on
many different kinds of devices. This affords new opportunities
for immersive, personalized, and interactive listening experiences.
This article introduces an end-to-end object-based spatial audio
pipeline, from sound recording to listening. A high-level
system architecture is proposed, which includes novel audiovisual
interfaces to support object-based capture and listenertracked
rendering, and incorporates a proposed component for
objectification, i.e., recording content directly into an object-based
form. Text-based and extensible metadata enable communication
between the system components. An open architecture for object
rendering is also proposed.
The system?s capabilities are evaluated in two parts. First,
listener-tracked reproduction of metadata automatically estimated
from two moving talkers is evaluated using an objective
binaural localization model. Second, object-based scene capture
with audio extracted using blind source separation (to remix
between two talkers) and beamforming (to remix a recording of
a jazz group), is evaluated with perceptually-motivated objective
and subjective experiments. These experiments demonstrate that
the novel components of the system add capabilities beyond
the state of the art. Finally, we discuss challenges and future
perspectives for object-based audio workflows.

The process of understanding acoustic properties of environments
is important for several applications, such as spatial
audio, augmented reality and source separation. In this paper,
multichannel room impulse responses are recorded and transformed
into their direction of arrival (DOA)-time domain, by
employing a superdirective beamformer. This domain can be
represented as a 2D image. Hence, a novel image processing
method is proposed to analyze the DOA-time domain, and
estimate the reflection times of arrival and DOAs. The main
acoustically reflective objects are then localized. Recent studies
in acoustic reflector localization usually assume the room
to be free from furniture. Here, by analyzing the scattered
reflections, an algorithm is also proposed to binary classify
reflectors into room boundaries and interior furniture. Experiments
were conducted in four rooms. The classification
algorithm showed high quality performance, also improving
the localization accuracy, for non-static listener scenarios.

In this paper, we propose a divide-and-conquer approach using
two generative adversarial networks (GANs) to explore
how a machine can draw colorful pictures (bird) using a small
amount of training data. In our work, we simulate the procedure
of an artist drawing a picture, where one begins with
drawing objects? contours and edges and then paints them
different colors. We adopt two GAN models to process basic
visual features including shape, texture and color. We use
the first GAN model to generate object shape, and then paint
the black and white image based on the knowledge learned
using the second GAN model. We run our experiments on
600 color images. The experimental results show that the use
of our approach can generate good quality synthetic images,
comparable to real ones.

Object-based audio has the potential to enable multime-
dia content to be tailored to individual listeners and their reproduc-
tion equipment. In general, object-based production assumes that the
objects|the assets comprising the scene|are free of noise and inter-
ference. However, there are many applications in which signal separa-
tion could be useful to an object-based audio work
ow, e.g., extracting
individual objects from channel-based recordings or legacy content, or
recording a sound scene with a single microphone array. This paper de-
scribes the application and evaluation of blind source separation (BSS)
for sound recording in a hybrid channel-based and object-based workflow, in which BSS-estimated objects are mixed with the original stereo
recording. A subjective experiment was conducted using simultaneously
spoken speech recorded with omnidirectional microphones in a rever-
berant room. Listeners mixed a BSS-extracted speech object into the
scene to make the quieter talker clearer, while retaining acceptable au-
dio quality, compared to the raw stereo recording. Objective evaluations
show that the relative short-term objective intelligibility and speech qual-
ity scores increase using BSS. Further objective evaluations are used to
discuss the in
uence of the BSS method on the remixing scenario; the
scenario shown by human listeners to be useful in object-based audio is
shown to be a worse-case scenario.

The challenge of installing and setting up dedicated spatial audio systems
can make it difficult to deliver immersive listening experiences to the general
public. However, the proliferation of smart mobile devices and the rise of
the Internet of Things mean that there are increasing numbers of connected
devices capable of producing audio in the home. \Media device orchestration"
(MDO) is the concept of utilizing an ad hoc set of devices to deliver
or augment a media experience. In this paper, the concept is evaluated by
implementing MDO for augmented spatial audio reproduction using objectbased
audio with semantic metadata. A thematic analysis of positive and
negative listener comments about the system revealed three main categories
of response: perceptual, technical, and content-dependent aspects. MDO
performed particularly well in terms of immersion/envelopment, but the
quality of listening experience was partly dependent on loudspeaker quality
and listener position. Suggestions for further development based on these
categories are given.

In this paper, we propose an iterative deep neural network
(DNN)-based binaural source separation scheme, for recovering
two concurrent speech signals in a room environment.
Besides the commonly-used spectral features, the DNN also
takes non-linearly wrapped binaural spatial features as input,
which are refined iteratively using parameters estimated from
the DNN output via a feedback loop. Different DNN structures
have been tested, including a classic multilayer perception
regression architecture as well as a new hybrid network
with both convolutional and densely-connected layers. Objective
evaluations in terms of PESQ and STOI showed consistent
improvement over baseline methods using traditional
binaural features, especially when the hybrid DNN architecture
was employed. In addition, our proposed scheme is robust
to mismatches between the training and testing data.

We present a novel pipeline to estimate reverberant
spatial audio object (RSAO) parameters given room
impulse responses (RIRs) recorded by ad-hoc microphone
arrangements. The proposed pipeline performs
three tasks: direct-to-reverberant-ratio (DRR) estimation;
microphone localization; RSAO parametrization.
RIRs recorded at Bridgewater Hall by microphones
arranged for a BBC Philharmonic Orchestra performance
were parametrized. Objective measures of
the rendered RSAO reverberation characteristics were
evaluated and compared with reverberation recorded
by a Soundfield microphone. Alongside informal listening
tests, the results confirmed that the rendered
RSAO gave a plausible reproduction of the hall, comparable
to the measured response. The objectification
of the reverb from in-situ RIR measurements unlocks
customization and personalization of the experience
for different audio systems, user preferences and playback
environments.

Binaural recording technology offers an inexpensive, portable solution for spatial audio capture. In this paper, a
full-sphere 2D localization method is proposed which utilizes the Model-Based Expectation-Maximization Source
Separation and Localization system (MESSL). The localization model is trained using a full-sphere head related
transfer function dataset and produces localization estimates by maximum-likelihood of frequency-dependent
interaural parameters. The model?s robustness is assessed using matched and mismatched HRTF datasets between
test and training data, with environmental sounds and speech. Results show that the majority of sounds are estimated
correctly with the matched condition in low noise levels; for the mismatched condition, a ?cone of confusion? arises
with albeit effective estimation of lateral angles. Additionally, the results show a relationship between the spectral
content of the test data and the performance of the proposed method.

We propose a novel method for full-sphere binaural sound
source localization that is designed to be robust to real world recording
conditions. A mask is proposed that is designed to remove
diffuse noise and early room reflections. The method makes use of
the interaural phase difference (IPD) for lateral angle localization
and spectral cues for polar angle localization. The method is tested
using different HRTF datasets to generate the test data and training
data. The method is also tested with the presence of additive
noise and reverberation. The method outperforms the state of the art
binaural localization methods for most testing conditions.

At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.

In applications such as virtual and augmented reality, a plausible
and coherent audio-visual reproduction can be achieved by deeply
understanding the reference scene acoustics. This requires knowledge
of the scene geometry and related materials. In this paper, we
present an audio-visual approach for acoustic scene understanding.
We propose a novel material recognition algorithm, that exploits
information carried by acoustic signals. The acoustic absorption
coefficients are selected as features. The training dataset was constructed
by combining information available in the literature, and
additional labeled data that we recorded in a small room having
short reverberation time (RT60). Classic machine learning methods
are used to validate the model, by employing data recorded in five
rooms, having different sizes and RT60s. The estimated materials
are utilized to label room boundaries, reconstructed by a visionbased
method. Results show 89 % and 80 % agreement between the
estimated and reference room volumes and materials, respectively.

Object-based audio can be used to customize, personalize, and optimize audio reproduction depending on the speci?c listening scenario. To investigate and exploit the bene?ts of object-based audio, a framework for intelligent metadata adaptation was developed. The framework uses detailed semantic metadata that describes the audio objects, the loudspeakers, and the room. It features an extensible software tool for real-time metadata adaptation that can incorporate knowledge derived from perceptual tests and/or feedback from perceptual meters to drive adaptation and facilitate optimal rendering. One use case for the system is demonstrated through a rule-set (derived from perceptual tests with experienced mix engineers) for automatic adaptation of object levels and positions when rendering 3D content to two- and ?ve-channel systems.

In this paper, we compare different deep neural
networks (DNN) in extracting speech signals from competing
speakers in room environments, including the conventional fullyconnected
multilayer perception (MLP) network, convolutional
neural network (CNN), recurrent neural network (RNN), and
the recently proposed capsule network (CapsNet). Each DNN
takes input of both spectral features and converted spatial
features that are robust to position mismatch, and outputs the
separation mask for target source estimation. In addition, a
psychacoustically-motivated objective function is integrated in
each DNN, which explores perceptual importance of each TF
unit in the training process. Objective evaluations are performed
on the separated sounds using the converged models, in terms
of PESQ, SDR as well as STOI. Overall, all the implemented
DNNs have greatly improved the quality and speech intelligibility
of the embedded target source as compared to the original
recordings. In particular, bidirectional RNN, either along the
temporal direction or along the frequency bins, outperforms the
other DNN structures with consistent improvement.

Loudspeaker-based sound systems, capable of a convincing reproduction of different audio streams to listeners in the same acoustic enclosure, are a convenient alternative to headphones. Such systems aim to generate "sound zones" in which target sound programmes are to be reproduced with minimum interference from any alternative programmes. This can be achieved with appropriate filtering of the source (loudspeaker) signals, so that the target sound's energy is directed to the chosen zone while being attenuated elsewhere. The existing methods are unable to produce the required sound energy ratio (acoustic contrast) between the zones with a small number of sources when strong room reflections are present. Optimization of parameters is therefore required for systems with practical limitations to improve their performance in reflective acoustic environments. One important parameter is positioning of sources with respect to the zones and room boundaries.

The first contribution of this thesis is a comparison of the key sound zoning methods implemented on compact and distributed geometrical source arrangements. The study presents previously unpublished detailed evaluation and ranking of such arrangements for systems with a limited number of sources in a reflective acoustic environment similar to a domestic room.

Motivated by the requirement to investigate the relationship between source positioning and performance in detail, the central contribution of this thesis is a study on optimizing source arrangements when strong individual room reflections occur. Small sound zone systems are studied analytically and numerically to reveal relationships between the geometry of source arrays and performance in terms of acoustic contrast and array effort (related to system efficiency). Three novel source position optimization techniques are proposed to increase the contrast, and geometrical means of reducing the effort are determined. Contrary to previously published case studies, this work presents a systematic examination of the key problem of first order reflections and proposes general optimization techniques, thus forming an important contribution.

The remaining contribution considers evaluation and comparison of the proposed techniques with two alternative approaches to sound zone generation under reflective conditions: acoustic contrast control (ACC) combined with anechoic source optimization and sound power minimization (SPM). The study provides a ranking of the examined approaches which could serve as a guideline for method selection for rooms with strong individual reflections.

Virtual Reality (VR) systems have been intensely explored, with several research communities investigating the
different modalities involved. Regarding the audio modality, one of the main issues is the generation of sound that
is perceptually coherent with the visual reproduction. Here, we propose a pipeline for creating plausible interactive
reverb using visual information: first, we characterize real environment acoustics given a pair of spherical cameras;
then, we reproduce reverberant spatial sound, by using the estimated acoustics, within a VR scene. The evaluation
is made by extracting the room impulse responses (RIRs) of four virtually rendered rooms. Results show agreement,
in terms of objective metrics, between the synthesized acoustics and the ones calculated from RIRs recorded within
the respective real rooms.

Recent progresses in Virtual Reality (VR) and Augmented Reality
(AR) allow us to experience various VR/AR applications in our
daily life. In order to maximise the immersiveness of user in VR/AR
environments, a plausible spatial audio reproduction synchronised
with visual information is essential. In this paper, we propose a
simple and efficient system to estimate room acoustic for plausible
reproducton of spatial audio using 360° cameras for VR/AR applications.
A pair of 360° images is used for room geometry and acoustic
property estimation. A simplified 3D geometric model of the scene
is estimated by depth estimation from captured images and semantic
labelling using a convolutional neural network (CNN). The real
environment acoustics are characterised by frequency-dependent
acoustic predictions of the scene. Spatially synchronised audio is
reproduced based on the estimated geometric and acoustic properties
in the scene. The reconstructed scenes are rendered with synthesised
spatial audio as VR/AR content. The results of estimated room
geometry and simulated spatial audio are evaluated against the actual
measurements and audio calculated from ground-truth Room
Impulse Responses (RIRs) recorded in the rooms.

Humans are able to identify a large number of environmental
sounds and categorise them according to high-level semantic
categories, e.g. urban sounds or music. They are also capable
of generalising from past experience to new sounds when
applying these categories. In this paper we report on the creation
of a data set that is structured according to the top-level
of a taxonomy derived from human judgements and the design
of an associated machine learning challenge, in which
strong generalisation abilities are required to be successful.
We introduce a baseline classification system, a deep convolutional
network, which showed strong performance with an
average accuracy on the evaluation data of 80.8%. The result
is discussed in the light of two alternative explanations:
An unlikely accidental category bias in the sound recordings
or a more plausible true acoustic grounding of the high-level
categories.

In order to maximise the immersion in VR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this work, we propose a pipeline to create plausible interactive audio from a pair of 360 degree cameras.

Frequency-invariant beamformers are useful for spatial audio capture since their attenuation of sources outside
the look direction is consistent across frequency. In particular, the least-squares beamformer (LSB) approximates
arbitrary frequency-invariant beampatterns with generic microphone configurations. This paper investigates the
effects of array geometry, directivity order and regularization for robust hypercardioid synthesis up to 15th order
with the LSB, using three 2D 32-microphone array designs (rectangular grid, open circular, and circular with
cylindrical baffle). While the directivity increases with order, the frequency range is inversely proportional to the
order and is widest for the cylindrical array. Regularization results in broadening of the mainlobe and reduced
on-axis response at low frequencies. The PEASS toolkit was used to evaluate perceptually beamformed speech
signals.