I really, really hope someone can help me with regards to this question.

I'm trying to implement a Hidden Markov Model (based off this paper: Here

I understand the processes, but, I do not understand what M would represent in the data I am trying to train the HMM with.

I am given this example:

"N = the number of hidden states
M = the number of distinct observation symbols
T = the number of observations
So, for the English text example, if you let N = 2, M = 27 (26 letters
plus word-space), and T = 50,000 (number of input letters to use), you
should see that the 2 hidden states correspond to consonants and
vowels."

This example works for the English Dictionary, I understand that. BUT I am attempting to train the HMM with MFCC Coefficients of a file (stop.mfc) which contains 4k+ values. Now my interpretation would be that: T = 4000; (The size of the 'Observable' sequence) and N = 2; ("Stop" and "Go") so therefore what would M represent in the example I am giving? Which is: Differentiating between someone saying "Stop" or "Go" Would M infer the number of training samples I have?

M represents an "alphabet" of vocal symbols that you are going to idetify using characteristics of speech that include the MFCC coefficient values along with various energy measurements. See: google.com/…
–
user2718Feb 19 '13 at 18:31

That is a good place to start. Maybe that would suffice as a first order approximation of symbols. If you review the referenced document, a robust set of symbols needs more information that the 13 MFCC coefficients alone.
–
user2718Feb 19 '13 at 18:56

@BruceZenone Thank you :)! I'll try training with 13, if that isn't enough I will attempt to use more.. But, thank you
–
PhorceFeb 19 '13 at 19:14

My previoius answer is in error. It isn't the number of MFCC values per block that determines the number of vocal symbols. It is the range of possible MFCC values that matters. You have to map the MFCC values along with other information to a finite set of "symbols". Identifying how to do such a mapping is quite complex.
–
user2718Feb 19 '13 at 20:01

1 Answer
1

Firstly, this is not because you have two words to identify that you need $N = 2$ states. Your goal is not to train a model with two states - one for each word to recognize - but to train 2 models, one for each word to recognize, and each of these models will have as many states as necessary. In fact, each state in your HMM should correspond to a distinct "stage" in the pronunciation of a word - and will very likely correspond to a phoneme. Your vocabulary size (here, two: "stop" and "go") is external to this. For "stop", there are 4 phonemes. For "go", there are 2 phonemes. So you train a 4-state left-to-right model on the "stop" data; and independently of this, a 2-state left-to-right model on the "go" data. To recognize a word given its MFCCs, you evaluate which of these two models has the highest likelihood given the data. If you had to recognize words within a lexicon of 10 words, you would similarly train 10 HMMs, one for each word, each of these models having a number of states suitable to the length/complexity of the word to recognize.

You need to step back and ask yourself "why HMMs in the first place?". We need HMM for speech recognition because words are made of a sequence of distinct elements in sequence (phonemes). If we want to describe/recognize the word "stop", we need to learn a description which is expressive enough to capture that "first it sounds like ssss, for a short while, then it is tttt for a short while, then it is oooo for a longer amount of time, then it is pppp for a short moment". HMMs are a good match for expressing that - states are phonemes, the transition matrix (which will be here diagonal + upper diagonal) indicates that we move through the word from first phoneme to last phoneme, staying a variable amount of time in each phoneme, and the distribution associated with each state indicates how each phoneme translates into your acoustic features.

It seems also that you are mixing up discrete HMM (in which the observations are drawn from a discrete distribution associated with each state) with continuous HMMs (in which the observations are scalars or vectors, characterized by a continuous distribution such as a gaussian). So the parameter $M$, number of distinct observation symbols, is irrelevant in your case, since your observations are 13-dimensional vectors, an uncountable set! ($M$ would be... the cardinality of the continuum).

I am afraid the introduction material you have picked is not directly relevant to speech recognition - though it is useful for applications in which HMMs are used to recover hidden structure from a discrete observations (and there are many of them, for example parsing/tagging in NLP). Try to master this material without thinking much about your speech recognition problem, and then move on to material about continuous HMMs with multivariate normal distributions - and finally to continuous HMM with mixtures of multivariate normal distributions (since this is what is likely to work best for speech).

Thank you for your long reply, I will defiantly look into this! In the paper I gave, the author gave an example (example 3) which showed how this material could be used to solve a task related to speech (Identifying whether someone is saying "Yes" or "No") and, my thought process is this: If I can "estimate the HMM" which given the observations will produce a vector of probabilities, I can then compare the given signal (after being through the HMM) to the training file and then using a "scoring algorithm" find the best possible match for the sequence. Does this look accurate?
–
PhorceFeb 19 '13 at 19:40

I don't necessarily want to go into more complex versions of Hidden Markov Models (I can understand them at a later date. Because of the time contraints that I have, it might be impossible to understand and implement continuous HMM's - But, it is defiantly something I will consider after my research is completed.
–
PhorceFeb 19 '13 at 19:42

What the author has written in the paper - that training is "Problem 3" and recognition is "Problem 1" (this applies to speech but also hand-writing, gesture recognition...) is true, but everything else in the document is about discrete HMMs, in which the data you process is in the form of discrete symbols (because this is how they are input or because the data has been quantized). Isn't it obvious to you that the data at the input of your recognition/training process is not discrete symbols from an alphabet, but vectors of reals? This is why the recipes in this document cannot be used as is.
–
pichenettesFeb 19 '13 at 20:03

1

I am not trying to sell you a "more complicated" version of HMMs. What you need to process are vectors of numbers (13-dimensional MFCC vectors), hence you need a tool to process vectors of numbers, and this tool is continuous HMMs. Not discrete HMMs.
–
pichenettesFeb 19 '13 at 20:04

1

I suggest you to continue studying this paper and writing code, but to solve a different problem which is more suitable for discrete HMMs (maybe ask a question here about some ideas of good "toy problems" for learning discrete HMMs). If you don't have a solid understanding of discrete HMMs, you'll struggle when studying continuous HMMs anyway...
–
pichenettesFeb 19 '13 at 20:21