Confused about Linear Predictive Coding (LPC)

Hi I'm taking a multimedia systems course and I'm preparing for my exam on tuesday. I'm trying to get my head around LPC compression on a general level, but I'm having trouble with what is going on with the linear predictive filter part. This is my understanding so far:

LPC works by digitising the analog signal and splitting it into segments. For each of the segments we determine the key features of the signal and try to encode these as accurately as possible. The key features are the pitch of the signal (i.e. the basic formant frequency), the loudness of the signal and whether the sound is voiced or unvoiced. Parameters called vocal tract excitation parameters are also determined which are used in the vocal tract model to better model the state of the vocal tract which generated the sound. This data is passed over the network and decoded at the receiver. The pitch of the signal is used as input to either a voiced or unvoiced synthesiser, and the loudness data is used to boost the amplitude of this resulting signal. Finally the vocal cord model filters this sound by applying the LPC coefficients which were sent over the network.

In my notes it says that the vocal tract model uses a linear predictive filter and that the nth sample is a linear combination of the previous p samples plus an error term, which comes from the synthesiser.

does this mean that we keep a running average of last p samples at both the encoder and decoder? So that at the encoder we only transmit data that corresponds to the difference between this average and actual signal?

Why is it an a linear combination of these previous samples? My understanding is that we extract the loudness, frequency and voiced/unvoiced nature of the sound and then generate these vocal tract excitation parameters by choosing them so that the difference between the actual signal and the predicted signal is as small as possible. Surely an AVERAGE of these previous samples would be a better indication of the next sample?

If there are any holes in my understanding if you could point them out that would be great! Thanks in advance!

Hi I'm taking a multimedia systems course and I'm preparing for my exam on tuesday. I'm trying to get my head around LPC compression on a general level, but I'm having trouble with what is going on with the linear predictive filter part. This is my understanding so far:

LPC works by digitising the analog signal and splitting it into segments. For each of the segments we determine the key features of the signal and try to encode these as accurately as possible. The key features are the pitch of the signal (i.e. the basic formant frequency), the loudness of the signal and whether the sound is voiced or unvoiced. Parameters called vocal tract excitation parameters are also determined which are used in the vocal tract model to better model the state of the vocal tract which generated the sound. This data is passed over the network and decoded at the receiver. The pitch of the signal is used as input to either a voiced or unvoiced synthesiser, and the loudness data is used to boost the amplitude of this resulting signal. Finally the vocal cord model filters this sound by applying the LPC coefficients which were sent over the network.

In my notes it says that the vocal tract model uses a linear predictive filter and that the nth sample is a linear combination of the previous p samples plus an error term, which comes from the synthesiser.

does this mean that we keep a running average of last p samples at both the encoder and decoder? So that at the encoder we only transmit data that corresponds to the difference between this average and actual signal?

Why is it an a linear combination of these previous samples? My understanding is that we extract the loudness, frequency and voiced/unvoiced nature of the sound and then generate these vocal tract excitation parameters by choosing them so that the difference between the actual signal and the predicted signal is as small as possible. Surely an AVERAGE of these previous samples would be a better indication of the next sample?

If there are any holes in my understanding if you could point them out that would be great! Thanks in advance!

The model says that the new sample is strictly a linear combination which means that we end up storing the coeffecients themselves to generate all of the data.

It also says it analyzes the intensity and frequency of a particular filtered signal. I'm not sure, but this sounds a lot like what is done in a say a projection to a fourier series basis.

The idea is that if you are given a signal and project it to a fourier basis you can get the individual intensities for each frequency and then you can do whatever you want with that information. You can reconstruct the signal using a linear combination of the wave-forms with that associated frequency.

I don't think they are doing this for the actual signal, but instead using that idea for a filtered signal and then using a model that converts this information to the model to capture things that correspond to what is important in speech production and throwing away the rest.

This is a common thing in compression in that you typically filter your signal or input by putting it through a 'black-box' that splits your data into components, each independent from each other, and then depending on the nature of the black-box, you keep what you need and then you store it in the best way and throw out the stuff that you don't need or at least can do without. Then when you restore everything, you use a 'reverse black-box' and eventually generate the output you expect.

This kind of thing is exactly the kind of thing that happens in video, audio, and image compression all the time when you watch your DVD's and listen to your MP3's.

It says on the wiki page how to predict things based on minimizing an error term and says how to solve for a given a matrix R and a vector r.

The expectation result is such that it minimizes the error which leads to that whole equation (again in the wiki page itself).

In terms of data transmission, it would make sense only to give a 'delta' value that allows you to reconstruct the next 'sample' given the linear predictive data, since the whole idea is to not only compress the data for storage, but for transmission as well. In memory, you would have your model as explained in the wiki if it is using the exact same scheme (which I think it is). So you would solve the linear system, get the new data and keep going for each new sample.

Once you have generated a wave-form corresponding to the speech, you can then just multiply either specific wave-forms by specific amplitudes, but from the sounds of it I think what actually happens is the normalized-waveform is generated and then the whole thing is scaled (multiplied) by a corresponding amplitude constant corresponding to how loud the signal/voice was.

I'd encourage you to double check if you need to, but based on those wiki-pages and also how things like fourier series analysis is used, that is my education guess on what is going on.

Do you need to understand this mathematically or not? If you do you will need to know the transform used (probably something fourier-series related) and how the coeffecients are transformed with respect to the vocal model and also in terms of the quantization scheme used.

I think you are misinterpreting "the running average" as meaing literally the average of the previous samples. Actually you multiply the previous values by different numbers so the pattern of the wave form continues.

As a simple example of how it works (with no explanation of where the "magic numbers" came from) suppose the LPC multiplies the two previous samples by 1.414 and -1, and you start with a signal of 0 and 0.707.
Then you get

From the starting values you are generating a repeating "sine wave" with a period of 8 samples.

In practice you work out the "best" set of multipliers so the "predicted" wave form is a good approximation the the actual one. If you use more than 2 multipliers, you can generate a wave that is a combination of several frequencies. You can also make the amplitude grow or decay with time.