► This thesis conducts a series of investigations into the estimation of speaker location cues from multi-party meeting speech recordings. As participants in meetings generally…
(more)

▼ This thesis conducts a series of investigations into the estimation of speaker location cues from multi-party meeting speech recordings. As participants in meetings generally remain stationary, speaker location information is fundamentally useful for higher-level tasks such as steering a microphone array beamformer towards an active speaker, or segmenting meeting speech into each speaker’s period of participation for ‘browsing’ of recordings and speech recognition.
Whilst existing speaker location cues are typically Time-Delay Estimations (TDE) estimated from microphone array signals, this thesis proposes the use of level and time/phase-based spatial cues, motivated by the spatial cues utilised in recently standardised Spatial Audio Coding (SAC) paradigms. Implemented using leading and standardised SAC techniques, experiments compared the proposed SAC spatial cues with TDE and found the combination of TDE with SAC level-based cues to be the most accurate for speech segmentation.
As meetings recordings predominantly contain speech content, front-end Linear- Prediction (LP) analysis using theoretical and standardised speech coders is then investigated with single and multichannel LP models. Whilst existing approaches estimate TDE from the Hilbert envelope of single-channel LP speech residuals, this thesis proposes the use of intra and interchannel multichannel prediction and found spatial cues estimated from the Hilbert envelope of LP residuals to be the most robust against reverberation.
Further experiments investigating the effect of microphone array characteristics found the microphone directivity pattern to significantly influence spatial cue estimation: the omnidirectional and cardioid polar responses optimally suit time/phase and levelbased cues, respectively. In practice, however, switching microphone patterns or employing mixed pattern arrays is impractical. This thesis proposes the use of the Ambisonic B-format steerable ‘virtual microphone’ to enable the same physical microphones to be simultaneously used for optimal capture of both time/phase and level-based cues. Further, results indicate that steering the virtual microphone in real-time to an active speaker, localised using sound intensity techniques, also improves meeting speech capture.
Thus, the work in this thesis has contributions in practical spatial meeting speech analysis, where investigations studied microphone array characteristics, spatial recording techniques, and algorithms in spatial cue estimation and speechprocessing as utilised in internationally standardised speech and audio coders.

▼ In a reverberant enclosure, acoustic speech signals are degraded by reflections from
walls, ceilings, and objects. Restoring speech quality and intelligibility from reverberated speech has received increasing interest over the past few years. Although multiple channel dereverberation methods provide some improvements in speech quality/
intelligibility, single-channel dereverberation remains an open challenge. Two types of advanced single-channel dereverberation methods, namely acoustic domain spectral subtraction and modulation domain filtering, provide small improvement in speech quality and intelligibility. In this thesis, we study single-channel dereverberation algorithms. Firstly, an
upper bound of time-frequency masking (TFM) performance for dereverberation is
obtained using ideal time-frequency masking (ITFM). ITFM has access to both the
clean and reverberated speech signals in estimating the binary-mask matrix. ITFM
implements binary masking in the short time Fourier transform (STFT) domain, preserving
only those spectral components less corrupted by reverberation. The experiment
results show that single-channel ITFM outperforms four existing multi-channel
dereverberation methods and suggest that large potential improvements could be
obtained using TFM for speech dereverberation. Secondly, a novel modulation domain spectral subtraction method is proposed for dereverberation. This method estimates modulation domain long reverberation spectral variance (LRSV) from time domain LRSV using a statistical room impulse response (RIR) model and implements spectral subtraction in the modulation domain. On one hand, different from acoustic domain spectral subtraction, our method
implements spectral subtraction in the modulation domain, which has been shown
to play an important role in speech perception. On the other hand, different from
modulation domain filtering which uses a time-invariant filter, our method takes the
changes of reverberated speech spectral variance along time into account and implements spectral subtraction adaptively. Objective and informal subjective tests show
that our proposed method outperforms two existing state-of-the-art single-channel
dereverberation algorithms.

► The desire for self-improvement is critical to human performance and learning outcomes. Paradoxically, however, being subjected to increased performance pressure can also result in “choking…
(more)

▼ The desire for self-improvement is critical to human performance and learning outcomes. Paradoxically, however, being subjected to increased performance pressure can also result in “choking under pressure”. No studies have experimentally examined the extent to which motivation impacts native speechprocessing. This dissertation manipulated performance pressure in listeners, and systematically examined its impact on three speech-processing experiments. Sixty adult native English listeners and 45 non-native listeners with poorer English proficiency completed three speechprocessing experiments, twice – once to establish a baseline, and again to measure changes in performance. In these experiments using native English speech, listeners detected (illusionary) sound changes, categorized phonemes under lexical interference, and recognized words in noises. After baseline testing, half of the participants in each language group were instructed to work, with a fictitious partner, towards a performance-contingent monetary reward; the other half, as controls, simply performed the tasks a second time. This study demonstrated a negative impact of performance pressure on native listeners in all experiments. Relative to the controls, the motivation group were more susceptible to illusions, failed to ignore lexical interference despite prior exposure, and recognized fewer words in cognitively-demanding listening situations. Unexpectedly, relative to native listeners, non-native listeners perceived it as less important to perform well, and those who were in the high performance-pressure group requested significantly greater amount of money for improvement. These language-group differences in task-related attitudes might be a confounding factor that moderate the effect of motivation. By illustrating a complex interaction among motivation, listener status, and performance-induced demands, this dissertation highlights the importance of motivation in speech science.
Advisors/Committee Members: Chandrasekaran, Bharath (advisor), Champlin, Craig A. (committee member), Henry, Maya L. (committee member), Peña, Elizabeth D. (committee member), Griffin, Zenzi M. (committee member).

This study investigates the potential use of Reconstructed Phase Space (RPS) based parameters for Speech Signal Processing by utilizing nonlinear dynamical systems theory. In this…
(more)

▼

This study investigates the potential use of
Reconstructed Phase Space (RPS) based parameters for Speech Signal
Processing by utilizing nonlinear dynamical systems theory. In this
approach features are extracted from the time domain. Study of
nonlinear dynamical system shows that the RPS is able to capture
the nonlinear information of the underlying system, that cannot be
captured by frequency domain analysis. A multimedia based system is
converted into a low cost data acquisition system by adding a
presampling antialiasing analog filter prior to A/D converter of
the sound card. A speech database of short vowels in Malayalam is
created. Nonlinear invariant parameters of vowel sounds are
calculated. With these parameters one can quantify the chaotic
behaviour of the speech signal. Reconstructed Phase Space is
generated for speech sounds by the method of time delay embedding.
From the reconstructed space, a unique parameter called
Reconstructed Phase Space Distribution Parameter (RPSDP) is
extracted. These parameters are found to be similar for same vowel
and differ from vowel to vowel. They are further used in the
recognition experiments. A new method for pitch estimation using
Reconstructed Phase Space in two dimensions is presented. The
proposed new method does not suffer from the limitations of other
short term pitch estimation techniques. The problems in choosing
the optimal time delay and the minimum embedding dimension for the
reconstruction of phase space using the method of delays are
addressed in this thesis. A simple procedure that quantifies
expansion from the identity line of embedding space is developed
for choosing proper time delay.

Prajith, P. (2008). Investigations on the applications of dynamical
instabilities and deterministic chaos for speech signal
processing. (Thesis). University of Calicut. Retrieved from http://shodhganga.inflibnet.ac.in/handle/10603/3960

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Chicago Manual of Style (16th Edition):

Prajith, P. “Investigations on the applications of dynamical
instabilities and deterministic chaos for speech signal
processing.” 2008. Thesis, University of Calicut. Accessed September 15, 2019.
http://shodhganga.inflibnet.ac.in/handle/10603/3960.

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Council of Science Editors:

Prajith P. Investigations on the applications of dynamical
instabilities and deterministic chaos for speech signal
processing. [Thesis]. University of Calicut; 2008. Available from: http://shodhganga.inflibnet.ac.in/handle/10603/3960

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

ENGLISH ABSTRACT: The importance of Language Identification for African languages is seeing a
dramatic increase due to the development of telecommunication infrastructure
and, as a result, an increase in volumes of data and speech traffic in public
networks. By automatically processing the raw speech data the vital assistance
given to people in distress can be speeded up, by referring their calls to a person
knowledgeable in that language.
To this effect a speech corpus was developed and various algorithms were implemented
and tested on raw telephone speech data. These algorithms entailed
data preparation, signal processing, and statistical analysis aimed at discriminating
between languages. The statistical model of Gaussian Mixture Models
(GMMs) were chosen for this research due to their ability to represent an entire
language with a single stochastic model that does not require phonetic transcription.
Language Identification for African languages using GMMs is feasible, although
there are some few challenges like proper classification and accurate
study into the relationship of langauges that need to be overcome. Other methods
that make use of phonetically transcribed data need to be explored and
tested with the new corpus for the research to be more rigorous.

Degree: National Centre for Language Technology (NCLT); Dublin City University. Research Institute for Networks and Communications Engineering (RINCE); Dublin City University. School of Computing, 2010, Dublin City University

► Many previous investigations have indicated that speech data has inherent low-dimensional structure and that it may be possible to efficiently represent speech using only a…
(more)

▼ Many previous investigations have indicated that speech data has inherent low-dimensional structure and that it may be possible to efficiently represent speech using only a small number of parameters. This view is motivated by the
fact that articulatory movement is limited by physiological constraints and thus the speech production apparatus has only limited degrees of freedom. Also, the set of sounds used in human spoken communication is only a small subset of all producible sounds. A number of dimensionality reduction methods capable of discovering such underlying structure have previously been applied to speech. However, if speech lies on a manifold nonlinearly embedded in high-dimensional space, as has been proposed in the past, classic linear dimensionality reduction methods would be unable to discover this embedding. In this dissertation a
number of manifold learning, also referred to as nonlinear dimensionality reduction, methods are applied to speech to explore the possibility of underlying nonlinear manifold structure.
This dissertation describes a number of existing manifold learning methods and details the application of these methods to high-dimensional feature representations of speech data. Representations derived from the conventional
magnitude spectrum and less widely used phase spectrum are investigated. The manifold learning methods used in this study are locally linear embedding, Isomap, and Laplacian eigenmaps. The classic linear method, principal component
analysis (PCA), is also applied to facilitate the comparison of linear and nonlinear methods. The resulting low-dimensional representations are analysed through visualisation, phone recognition, and speaker recognition experiments. The recognition experiments are used as a means of evaluating how much meaningful discriminatory information is contained in the low-dimensional
representations produced by each method. These experiments also serve to display the potential value of these methods in speechprocessing applications.
The manifold learning methods are shown to be capable of producing meaningful lowdimensional representations of speech data suggesting speech has low-dimensional manifold structure. In general, these methods are found to outperform PCA in low dimensions, indicating that speech may lie on a manifold nonlinearly embedded in high-dimensional space. Phone classification experiments
show that Isomap can offer improvements over standard features and PCA-transformed features. Investigation of magnitude and phase spectrum representations found both to have similar low-dimensional structure and confirm that the phase spectrum contains useful information for phone discrimination. Results indicate that combining magnitude and phase spectrum information yields improvements in phone classification tasks. A method to combine magnitude and
phase spectrum features for increased phone classification accuracy without large increases in feature dimensionality is also described.
Advisors/Committee Members: McKenna, John, IRCSET.

► The complexity of finding the relevant features for the classification of spoken letters is due to the phonetic similarities between letters and their high dimensionality.…
(more)

▼ The complexity of finding the relevant features for
the classification of spoken letters is due to the phonetic
similarities between letters and their high dimensionality. Spoken
letter classification in machine learning literature has often led
to very convoluted algorithms to achieve successful classification.
The success in this work can be found in the high classification
rate as well as the relatively small amount of computation required
between signal retrieval to feature selection. The relevant
features spring from an analysis of the sequential properties
between the vectors produced from a Fourier transform. The study
mainly focuses on the classification of fricative letters f and s,
m and n, and the eset (b,c,d,e,g,p,t,v,z) which are highly
indistinguishable, especially when transmitted over the modern VoIP
digital devices. Another feature of this research is the dataset
produced did not include signal processing that reduces noise which
is shown to produce equivalent and sometimes better results. All
pops and static noises that appear were kept as part of the sound
files. This is in contrast to other research that recorded their
dataset with high grade equipment and noise reduction algorithms.
To classify the audio files, the machine learning algorithm that
was used is called the random forest algorithm. This algorithm was
successful because the features produced were largely separable in
relatively few dimensions. Classification accuracies were in the
92%-97% depending on the dataset.; Audio Analysis, Feature
Extraction, Feature Reduction, Spoken Word
Recognition
Advisors/Committee Members: Shanmugathasan Suthaharan (advisor).

Speech recognition systems have improved in robustness in recent years with respect to both speaker and acoustical variability. Nevertheless, it is still a challenge to…
(more)

▼

Speech recognition systems have improved in robustness in recent years with respect to both speaker and acoustical variability. Nevertheless, it is still a challenge to deploy speech recognition systems in real-world applications that are exposed to diverse and significant level of noise. Robustness and recognition accuracy are the essential criteria in determining the extent of a speech recognition system deployed in real-world applications. This work involves development of techniques and extensions to extract robust features from speech and achieve substantial performance in speech recognition. Robustness and recognition accuracy are the top concern in this research. In this work, the robustness issue is approached using the front-end processing, in particular robust feature extraction. The author proposes an unified framework for robust feature and presents a comprehensive evaluation on robustness in speech features. The framework addresses three distinct approaches: robust feature extraction, temporal information inclusion and normalization strategies. The author discusses the issue of robust feature selection primarily in the spectral and cepstral context. Several enhancement and extensions are explored for the purpose of robustness. This includes a computationally efficient approach proposed for moment normalization. In addition, a simple back-end approach is incorporated to improve recognition performance in reverberant environments. Speech features in this work are evaluated in three distinct environments that occur in real-world scenarios. The thesis also discusses the effect of noise on speech features and their parameters. The author has established that statistical properties play an important role in mismatches. The significance of the research is strengthened by the evaluation of robust approaches in more than one scenario and the comparison with the performance of the state-of-the-art features. The contributions and limitations of each robust feature in all three different environments are highlighted. The novelty of the work lies in the diverse hostile environments which speech features are evaluated for robustness. The author has obtained recognition accuracy of more than 98.5% for channel distortion. Recognition accuracy greater than 90.0% has also been maintained for reverberation time 0.4s and additive babble noise at SNR 10dB. The thesis delivers a comprehensive research on robust speech features for speech recognition in hostile environments supported by significant experimental results. Several observations, recommendations and relevant issues associated with robust speech features are presented.

Speech recognition systems have improved in robustness in recent years with respect to both speaker and acoustical variability. Nevertheless, it is still a challenge to deploy speech recognition systems in real-world applications that are exposed to diverse and significant level of noise. Robustness and recognition accuracy are the essential criteria in determining the extent of a speech…

Toh AM. Robust speech features for speech recognition in hostile environments. [Doctoral Dissertation]. University of Western Australia; 2007. Available from: http://repository.uwa.edu.au:80/R/?func=dbin-jump-full&object_id=10216&local_base=GEN01-INS01

► This dissertation presents the development of sensorimotor primitives as a means of constructing a language-agnostic model of speech communication. Insights from major theories in speech…
(more)

▼ This dissertation presents the development of sensorimotor primitives as a means of constructing a language-agnostic model of speech communication. Insights from major theories in speech science and linguistics are used to develop a conceptual framework for sensorimotor primitives in the context of control and information theory. Within this conceptual framework, sensorimotor primitives are defined as a system transformation that simplifies the interface to some high dimensional and/or nonlinear system. In the context of feedback control, sensorimotor primitives take the form of a feedback transformation. In the context of communication, sensorimotor primitives are represented as a channel encoder and decoder pair. Using a high fidelity simulation of articulatory speech synthesis, these realizations of sensorimotor primitives are respectively applied to feedback control of the articulators, and communication via the acoustic speech signal. Experimental results demonstrate the construction of a model of speech communication that is capable of both transmitting and receiving information, and imitating simple utterances.
Advisors/Committee Members: Levinson, Stephen E (advisor), Levinson, Stephen E (Committee Chair), Hasegawa-Johnson, Mark (committee member), Shosted, Ryan K (committee member), Varshney, Lav R (committee member).

▼ This dissertation examines native speaker perception
and processing of the variability inherent to non-native speech,
specifically, Mandarin-accented English. In order to accomplish
this, subjective ratings were first collected as a measure of
perceived foreign accentedness. The influence of linguistic
variables (both acoustic and lexical) was investigated with regard
to the perception of gradient foreign accentedness. The results of
the ratings study indicate that the perception of gradient
accentedness is influenced by measures of acoustic distance (i.e.,
magnitude of difference between a given production and an average
native production) as well as properties of the lexical items
themselves (e.g., neighborhood density and phonotactic
probability). These ratings were then utilized to investigate how
the gradient nature of accentedness influences lexical processing
across behavioral, (visual word) eye-tracking, and
electrophysiological methods. Additionally, it was also possible to
examine how self-reported listener experience with Chinese-accented
speakers influences the processing of gradient accentedness. These
studies provide converging and complementary evidence that
processing of these tokens varies non-linearly along the
accentedness continuum and by level of listener experience. The
electrophysiological results indicate that the pattern of
processing, including the allocation of perceptual and attentional
resources, changes as a result of increased exposure to
Chinese-accented English. The effect of experience is also seen in
both reaction time and visual world eye-tracking data. The results
of a cross-modal priming study indicate that degree of foreign
accent modulates the strength with which lexical representations
are primed and that listener experience with the accent in question
mitigates this effect. Visual world eye-tracking presents similar
results, showing that the time-course of word recognition is slowed
as accentedness increases, though the ability to decode the signal
is enhanced for listeners with greater experience. Taken together,
the results come to bear on our understanding of how gradient
foreign-accented speech maps onto linguistic representations and
how those representations may change and adapt over time to
accommodate the variability inherent to foreign-accented
speech.

► In a paper published by Greenberg in 1998, it was said that in conversational speech, phone deletion rate may go as high as 12%. On…
(more)

▼ In a paper published by Greenberg in 1998, it was said that in conversational speech, phone deletion rate may go as high as 12%. On the other hand, Jurafsky reported in 2001 that phone deletions cannot be modeled well by traditional triphone training. These findings motivate us to model phone deletions explicitly in current ASR systems. In this thesis, phone deletions are modeled by adding skip arcs to the word models. In order to cope with the limitations of using whole word models, context-dependent fragmented word models(CD-FWMs) are proposed. Our proposed method is evaluated on both read speech (Wall Street Journal) and conversational speech (SVitchboard) task. In the read speech evaluation, we obtained a word error rate reduction of about 11%. Although the improvement in conversational speech is modest, reasons are given and relevant analyses are carried out.

► In triphone-based acoustic modeling, it is difficult to robustly model infrequent triphones due to their lack of training samples. Naive maximum-likelihood (ML) estimation of infrequent…
(more)

▼ In triphone-based acoustic modeling, it is difficult to robustly model infrequent triphones due to their lack of training samples. Naive maximum-likelihood (ML) estimation of infrequent triphone models produces poor triphone models and eventually affects the overall performance of an automatic speech recognition (ASR) system. Among different techniques proposed to solve the infrequent triphone problem, the most widely used method in current ASR systems is state tying because of its effectiveness in reducing model size and achieving good recognition results. However, state tying inevitably introduces quantization errors since triphones tied to the same state are not distinguishable in that state. This thesis addresses the problem by the use of distinct acoustic modeling where every modeling unit has a unique model and a distinct acoustic score. The main contribution of this thesis is the formulation of the estimation of triphone models as an adaptation problem through our proposed distinct acoustic modeling framework named eigentriphone modeling. The rational behind eigentriphone modeling is that a basis is derived from the frequent triphones and then each triphone is modeled as a point in the space spanned by the basis. The eigenvectors in the basis represent the most important context-dependent characteristics among the triphones and thus the infrequent triphones can be robustly modeled with few training samples. Furthermore, the proposed framework is very flexible and can be applied to other modeling units. Since grapheme-based modeling is useful in automatic speech recognition of under-resourced languages, we further apply our distinct acoustic modeling framework to estimate context-dependent grapheme models and we call our new method eigentrigrapheme modeling. Experimental evaluation of eigentriphone modeling was carried out on the Wall Street Journal word recognition task and the TIMIT phoneme recognition task. Experimental evaluation of eigentrigrapheme modeling was carried out on four official South African under-resourced languages. It is shown that distinct acoustic modeling using the proposed eigentriphone framework consistently performs better than the conventional tied-state HMMs.

▼ This thesis explores the spatiotemporal network dynamics underlying natural speech comprehension, as measured by electro-magnetoencephalography (E/MEG). I focus on the transient effects of incrementality and constraints in speech on access to lexical semantics. Through three E/MEG experiments I address two core issues in systems neuroscience of language: 1) What are the network dynamics underpinning cognitive computations that take place when we map sounds to rich semantic representations? 2) How do the prior semantic and syntactic contextual constraints facilitate this mapping? Experiment 1 investigated the cognitive processes and relevant networks that come online prior to a word’s recognition point (e.g. “f” for butterfly) as we access meaning through speech in isolation. The results revealed that 300 ms before the word is recognised, the speech incrementally activated matching phonological and semantic representations resulting in transient competition. This competition recruited LIFG, and modality specific regions (LSMG, LSTG for the phonological; LAG and MTG for the semantic domain). Immediately after the word’s recognition point the semantic representation of the target concept was boosted, and rapidly accessed recruiting bilateral MTG and AG. Experiment 2 explored the cortical networks underpinning contextual semantic processing in speech. Participant listened to two-word spoken phrases where the semantic constraint provided by the modifier was manipulated. To separate out cognitive networks that are modulated by semantic constraint from task positive networks I performed a temporal independent component analysis. Among 14 networks extracted, only the activity of bilateral AG was modulated by semantic constraint between -400 to -300 ms before the noun’s recognition point. Experiment 3 addressed the influence of sentential syntactic constraint on anticipation and activation of upcoming syntactic frames in speech. Participants listened to sentences with local syntactic ambiguities. The analysis of the connectivity dynamics in the left frontotemporal syntax network showed that the processing of sentences that contained the less anticipated syntactic structure showed early increased feedforward information flow in 0-100 ms, followed by increased recurrent connectivity between LIFG and LpMTG from the 200-500 ms from the verb onset. Altogether the three experiments reveal novel insights into transient cognitive networks recruited incrementally over time both in the absence of and with context, as the speech unfolds, and how the activation of these networks are modulated by contextual syntactic and semantic constraints. Further I provide neural evidence that contextual constraints serve to facilitate speech comprehension, and how the speech networks recover from failed anticipations.

► Most technological problems associated with man-machine conversations have been solved and are well documented in both the contemporary technical and lay literature. The major remaining…
(more)

▼ Most technological problems associated with man-machine conversations
have been solved and are well documented in both the
contemporary technical and lay literature. The major remaining
technological problem is the real time conversion of a human's vocal
uterances into a "written" phoneme sequence representing the information
content of speech. The work presented here demonstrates a
viable solution of the problem of real time machine determination of
the information content of human speech. The solution presented
differs extensively from previous attempts to solve this problem
which used relatively crude template matching techniques.
The techniques reported here are not limited to a fixed vocabulary
or a master's voice as previous attempts have been. They are
self-adapting to each individual's speech patterns regardless of
dialect, nationality, or language. This system is capable of
determining the information content present in human uterances with
a greater degree of accuracy than a human listener can perceive
speech, disregarding contextual information. Also, human voice
perception is not limited by noise masking to the extent exhibited by
the human ear.
Operationally, the real time, hybrid computer system first
estimates the model parameters of the human speech generation
mechanism from phonetic features extracted from the speech waveform.
The system next is able to deterministically relate the
estimates of the human's speech generation model parameters to the
phonemes (the information content) present in the human's vocal
utterance. The speech recognition process is a multi-level process
directly analogous to the multi-level speech recognition method used
by a human listener.
Advisors/Committee Members: Stone, S. A. (advisor).

► This thesis describes the design and testing of time-varying inductors and capacitors for use in an electrical analogue of the human vocal tract. The inductors…
(more)

▼ This thesis describes the design and testing of time-varying inductors and capacitors for use in an electrical analogue of the human vocal tract. The inductors and capacitors were varied in accordance with an external control signal by varying the value of a resistor in a circuit which used operational amplifiers to simulate a variable impedance. The inductor actually tested is not a true inductor, since its voltage e and current i are related by the equation [formula omitted], where L(t) is an externally controlled time function. A device for which [formula omitted] will probably be adequate for use in a vocal tract analogue. A true inductor for which [formula omitted] can be realized by making a change in the circuit tested. For the inductor tested, the maximum allowable input voltage and current are ± 2 volts and ± 2 ma, respectively. For the capacitor, the allowable ranges are ± 4 volts and ± 20 ma. The inductance and capacitance can be varied over a range of 250:1 with good linearity with respect to external control voltage and audio frequency. The inductor's Q exceeds 50 and the capacitor's Q exceeds 200 for all frequencies between 200 Hz and 5 KHz.
A system for routing control signals from a digital computer to the vocal tract analogue has been devised. Each component in the analogue is to be serviced by the computer at discrete time intervals. Between computer service times, the value of each component is interpolated by the up-down counter-digital comparator interpolating system described in the thesis.

Wickwire, K. F. (1967). Design and testing of time-varying inductors and capcitors for an electrical speech synthesizer
. (Thesis). University of British Columbia. Retrieved from http://hdl.handle.net/2429/36155

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Chicago Manual of Style (16th Edition):

Wickwire, Kenneth Freeman. “Design and testing of time-varying inductors and capcitors for an electrical speech synthesizer
.” 1967. Thesis, University of British Columbia. Accessed September 15, 2019.
http://hdl.handle.net/2429/36155.

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Wickwire KF. Design and testing of time-varying inductors and capcitors for an electrical speech synthesizer
. [Internet] [Thesis]. University of British Columbia; 1967. [cited 2019 Sep 15].
Available from: http://hdl.handle.net/2429/36155.

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Council of Science Editors:

Wickwire KF. Design and testing of time-varying inductors and capcitors for an electrical speech synthesizer
. [Thesis]. University of British Columbia; 1967. Available from: http://hdl.handle.net/2429/36155

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

► The ultimate goal of automatic speech recognition (ASR) research is to allow a computer to recognize speech in real-time, with full accuracy, independent of vocabulary…
(more)

▼ The ultimate goal of automatic speech recognition (ASR) research is to allow a computer to recognize speech in real-time, with full accuracy, independent of vocabulary size, noise, speaker characteristics or accent. Today, systems are trained to learn an individual speaker's voice and larger vocabularies statistically, but accuracy is not ideal. A small gap between actual speech and acoustic speech representation in the statistical mapping causes a failure to produce a match of the acoustic speech signals by Hidden Markov Model (HMM) methods and consequently leads to classification errors. Certainly, these errors in the low level recognition stage of ASR produce unavoidable errors at the higher levels. Therefore, it seems that ASR additional research ideas to be incorporated within current speech recognition systems. This study seeks new perspective on speech recognition. It incorporates a new approach for speech recognition, supporting it with wider previous research, validating it with a lexicon of 533 words and integrating it with a current speech recognition method to overcome the existing limitations. The study focusses on applying image processing to speech spectrogram images (SSI). We, thus develop a new writing system, which we call the Speech-Image Recogniser Code (SIR-CODE). The SIR-CODE refers to the transposition of the speech signal to an artificial domain (the SSI) that allows the classification of the speech signal into segments. The SIR-CODE allows the matching of all speech features (formants, power spectrum, duration, cues of articulation places, etc.) in one process. This was made possible by adding a Realization Layer (RL) on top of the traditional speech recognition layer (based on HMM) to check all sequential phones of a word in single step matching process. The study shows that the method gives better recognition results than HMMs alone, leading to accurate and reliable ASR in noisy environments. Therefore, the addition of the RL for SSI matching is a highly promising solution to compensate for the failure of HMMs in low level recognition. In addition, the same concept of employing SSIs can be used for whole sentences to reduce classification errors in HMM based high level recognition. The SIR-CODE bridges the gap between theory and practice of phoneme recognition by matching the SSI patterns at the word level. Thus, it can be adapted for dynamic time warping on the SIR-CODE segments, which can help to achieve ASR, based on SSI matching alone.

►Speech intelligibility represents how comprehensible a speech is. It is more important than speech quality in some applications. Single channel speech intelligibility enhancement is much…
(more)

▼Speech intelligibility represents how comprehensible a speech is. It is more important than speech quality in some applications. Single channel speech intelligibility enhancement is much more difficult than multi-channel intelligibility enhancement. It has recently been reported that training-based single channel speech intelligibility enhancement algorithms perform better than Signal to Noise Ratio (SNR) based algorithm. In this thesis, a training-based Deep Neural Network (DNN) is used to improve single channel speech intelligibility. To increase the performance of the DNN, the Multi-Resolution Cochlea Gram (MRCG) feature set is used as the input of the DNN. MATLAB objective test results show that the MRCG-DNN approach is more robust than a Gaussian Mixture Model (GMM) approach. The MRCG-DNN also works better than other DNN training algorithms. Various conditions such as different speakers, different noise conditions and reverberation were tested in the thesis.

▼ This thesis explores the spatiotemporal network dynamics underlying natural speech comprehension, as measured by electro-magnetoencephalography (E/MEG). I focus on the transient effects of incrementality and constraints in speech on access to lexical semantics. Through three E/MEG experiments I address two core issues in systems neuroscience of language: 1) What are the network dynamics underpinning cognitive computations that take place when we map sounds to rich semantic representations? 2) How do the prior semantic and syntactic contextual constraints facilitate this mapping?
Experiment 1 investigated the cognitive processes and relevant networks that come online prior to a word’s recognition point (e.g. “f” for butterfly) as we access meaning through speech in isolation. The results revealed that 300 ms before the word is recognised, the speech incrementally activated matching phonological and semantic representations resulting in transient competition. This competition recruited LIFG, and modality specific regions (LSMG, LSTG for the phonological; LAG and MTG for the semantic domain). Immediately after the word’s recognition point the semantic representation of the target concept was boosted, and rapidly accessed recruiting bilateral MTG and AG.
Experiment 2 explored the cortical networks underpinning contextual semantic processing in speech. Participant listened to two-word spoken phrases where the semantic constraint provided by the modifier was manipulated. To separate out cognitive networks that are modulated by semantic constraint from task positive networks I performed a temporal independent component analysis. Among 14 networks extracted, only the activity of bilateral AG was modulated by semantic constraint between -400 to -300 ms before the noun’s recognition point.
Experiment 3 addressed the influence of sentential syntactic constraint on anticipation and activation of upcoming syntactic frames in speech. Participants listened to sentences with local syntactic ambiguities. The analysis of the connectivity dynamics in the left frontotemporal syntax network showed that the processing of sentences that contained the less anticipated syntactic structure showed early increased feedforward information flow in 0-100 ms, followed by increased recurrent connectivity between LIFG and LpMTG from the 200-500 ms from the verb onset.
Altogether the three experiments reveal novel insights into transient cognitive networks recruited incrementally over time both in the absence of and with context, as the speech unfolds, and how the activation of these networks are modulated by contextual syntactic and semantic constraints. Further I provide neural evidence that contextual constraints serve to facilitate speech comprehension, and how the speech networks recover from failed anticipations.

Kocagoncu, E. (2017). Dynamic speech networks in the brain: Dual contribution of incrementality and constraints in access to semantics
. (Thesis). University of Cambridge. Retrieved from https://www.repository.cam.ac.uk/handle/1810/270309

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Chicago Manual of Style (16th Edition):

Kocagoncu, Ece. “Dynamic speech networks in the brain: Dual contribution of incrementality and constraints in access to semantics
.” 2017. Thesis, University of Cambridge. Accessed September 15, 2019.
https://www.repository.cam.ac.uk/handle/1810/270309.

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Council of Science Editors:

Kocagoncu E. Dynamic speech networks in the brain: Dual contribution of incrementality and constraints in access to semantics
. [Thesis]. University of Cambridge; 2017. Available from: https://www.repository.cam.ac.uk/handle/1810/270309

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

University of Illinois – Urbana-Champaign

28.
Jacobs, Cassandra L.Knowing a thing is "a thing": The use of acoustic features in multiword expression extraction.

► Speakers of a language need to have complex linguistic representations for speaking, often on the level of non-literal, idiomatic expressions like black sheep. Typically, datasets…
(more)

▼ Speakers of a language need to have complex linguistic representations for speaking, often on the level of non-literal, idiomatic expressions like black sheep. Typically, datasets of these so-called multiword expressions come from hand-crafted ontologies or lexicons, because identifying expressions like these in an unsupervised manner is still an unsolved problem in natural language processing. In this thesis I demonstrate that prosodic features, which are helpful in parsing syntax and interpreting meaning, can also be used to identify multiword expressions. To do this, I extracted noun phrases from the Buckeye corpus, which contains spontaneous spoken language, and matched these noun phrases to page titles in Wikipedia, a massive, freely available encyclopedic ontology of entities and phenomena. By incorporating prosodic features into a model that distinguishes between multiword expressions that are found in Wikipedia titles and those that are not, we see increases in classifier performance that suggests that prosodic cues can help with the automatic extraction of multiword expressions from spontaneous speech, helping models and potentially listeners decide whether something is "a thing" or not.
Advisors/Committee Members: Fleck, Margaret (advisor).

Jacobs, C. L. (2016). Knowing a thing is "a thing": The use of acoustic features in multiword expression extraction. (Thesis). University of Illinois – Urbana-Champaign. Retrieved from http://hdl.handle.net/2142/92965

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Chicago Manual of Style (16th Edition):

Jacobs, Cassandra L. “Knowing a thing is "a thing": The use of acoustic features in multiword expression extraction.” 2016. Thesis, University of Illinois – Urbana-Champaign. Accessed September 15, 2019.
http://hdl.handle.net/2142/92965.

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

MLA Handbook (7th Edition):

Jacobs, Cassandra L. “Knowing a thing is "a thing": The use of acoustic features in multiword expression extraction.” 2016. Web. 15 Sep 2019.

Vancouver:

Jacobs CL. Knowing a thing is "a thing": The use of acoustic features in multiword expression extraction. [Internet] [Thesis]. University of Illinois – Urbana-Champaign; 2016. [cited 2019 Sep 15].
Available from: http://hdl.handle.net/2142/92965.

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

Council of Science Editors:

Jacobs CL. Knowing a thing is "a thing": The use of acoustic features in multiword expression extraction. [Thesis]. University of Illinois – Urbana-Champaign; 2016. Available from: http://hdl.handle.net/2142/92965

Note: this citation may be lacking information needed for this citation format:Not specified: Masters Thesis or Doctoral Dissertation

► Understanding and describing human emotional state is important for many applications such as interactive human-computer interface design and clinical diagnosis tools. Speech based emotion prediction…
(more)

▼ Understanding and describing human emotional state is important for many applications such as interactive human-computer interface design and clinical diagnosis tools. Speech based emotion prediction is generally viewed as a regression problem, where speech waveforms are labelled in terms of affective attributes such as arousal and valence, with numerical values indicating the short-term emotion intensity. Current research on continuous emotion prediction has primarily focused on improving the backend, developing novel features or improving feature selection techniques. However, emotion expressions or perceptions are in general heterogeneous across individuals, depending on a wide range of factors, such as cultural background and speaker’s gender. The impact of these sources of variations on the continuous emotion prediction systems has not been fully explored yet and is the focus of this thesis. Speaker variability, i.e., differences in emotion expression among speakers, has been shown to be one of the most confounding factors in categorical emotion recognition system, but there is limited literature that analyses the effect on continuous emotion prediction systems. In this thesis, a probabilistic framework is proposed to quantify speaker variability in continuous emotion systems in both the feature and the model domains. Furthermore, three compensation techniques for speaker variability are developed and in-depth analyses in both the feature and model spaces are carried out. Another confounding factor is the inter-rater variability, i.e., difference in emotion perception among raters, which is ignored in current approaches by taking the average rating across multiple raters as the ‘true’ representation of the emotion states. However, differences in perception among raters suggest that prediction certainty varies with time. A novel approach for the prediction of emotion uncertainty is proposed and implemented by including the inter-rater variability as a representation of the uncertainty information in a probabilistic model. In addition, Kalman filters are incorporated into this framework to take into account the temporal dependencies of the emotion uncertainty, as well as providing the flexibility to relax the Gaussianity assumption on the emotion distribution that reflects the uncertainty. The proposed frameworks and methods have been extensively evaluated on multiple state-of-the-art databases and the results have demonstrated the potential of the proposed solutions.
Advisors/Committee Members: Sethu, Vidhyasaharan, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW, Ambikairaja, Eliathamby, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW.

Dang, T. (2018). Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty. (Doctoral Dissertation). University of New South Wales. Retrieved from http://handle.unsw.edu.au/1959.4/60161 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51221/SOURCE02?view=true

Chicago Manual of Style (16th Edition):

Dang, Ting. “Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty.” 2018. Doctoral Dissertation, University of New South Wales. Accessed September 15, 2019.
http://handle.unsw.edu.au/1959.4/60161 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51221/SOURCE02?view=true.

Dang T. Speech based Continuous Emotion Prediction: An investigation of Speaker Variability and Emotion Uncertainty. [Doctoral Dissertation]. University of New South Wales; 2018. Available from: http://handle.unsw.edu.au/1959.4/60161 ; https://unsworks.unsw.edu.au/fapi/datastream/unsworks:51221/SOURCE02?view=true