In this course you will learn about audio signal processing methodologies that are specific for music and of use in real applications. We focus on the spectral processing techniques of relevance for the description and transformation of sounds, developing the basic theoretical and practical knowledge with which to analyze, synthesize, transform and describe audio signals in the context of music applications.
The course is based on open software and content. The demonstrations and programming exercises are done using Python under Ubuntu, and the references and materials for the course come from open online repositories. We are also distributing with open licenses the software and materials developed for the course.

TB

Good lectures with a focus on practical applications. Good introduction to how signal processing can be used for musical analysis, and more specifically how to use the Essentia library

FR

Dec 04, 2016

Filled StarFilled StarFilled StarFilled StarFilled Star

Very well explained and organized course material. The classes are also very detailed and special emphasis is put on illustrating every concept with example plots.

レッスンから

Short-time Fourier transform

STFT equation; analysis window; FFT size and hop size; time-frequency compromise; inverse STFT. Demonstration of tools to compute the spectrogram of a sound and on how to analyze a sound using them. Implementation of the windowing of sounds using Python and presentation of the STFT functions from the sms-tools package, explaining how to use them.

講師

Xavier Serra

Full Professor

Prof Julius O Smith, III

Professor of Music and (by courtesy) Electrical Engineering

字幕

Welcome back to the course on artisanal processing for musical applications. Until now we have seen how to compute the spectrum of a signal, its EFT but real sounds cannot be represented by a single spectrum. Sounds change in time and we need to capture this time variation. The short-time Fourier transform is our solution. Again, this lecture is divided into two parts. So, this is the first one. We will first present and explain the Short-time Fourier Transform equation and then discuss what we call the analysis window. This is the Short-time Fourier Transform equation, basically a modified version of the DFT. With few but important differences. So for example, the input to the equation, the input signal, is not just x of n but is the multiplication of w which is our analysis window, by a fragment of x of n. Okay, so here x has an argument that has n, our time index, but also has a frame number and a hop-size. So, l is the frame number and this is our time index so we will be Iterating over l. So, we will be skipping thru time this way and capital H is our hop-size. How much we going to hop from one time instances to the next. So, basically x is going to be changing in time according to l and H and then at every time instance, at every error, it's going to be multiplied by this analysis window, w of n. The rest is the DFT. So, the only thing that changes is that the input signal changes. And therefore, the output is also is not a single spectrum but a sequence of spectra. There is the x sub l, so the variable l is the frame number. So that means, that the output of the Short-time Fourier transform is going to be a sequence of spectra. Each one of the same size and having magnitudes and fades but each one differently because the input will be a different fragment of the sound, stepping through the sound in a progressive manner. So, to emphasize the idea of zero phase windowing that we already have talked about. From now on, we generally specify the timing next to go from minus over N over 2 to N over 2 minus 1. Ok, so it's always centered around zero we do not have any phase changes. We don't have any kind of shifting in the time, and therefore in the spectrum. The windowing is a way to step through the sound, as I mentioned. So, here we can see a depiction of that and if we use the analogy of image and video we could relate a spectrum with a photograph, a static image and then the short time for a transfer with video, a time varied image. So, here we see in this picture that the whole time for a sound at the bottom and how we are basically stepping through the sound by windowing the sound with this analysis window. And therefore, being able to get all the sound as a sum of basically sound fragments. Okay, to better understand the fact of windowing a sound, let's put an example of what happens when we window a real sinusoid and then compute its spectrum. So, if we start from a real sinusoid we already have seen that. So, it's a cosine with a frequency index, k subzero, and an amplitude, A subzero, which can be expressed as the sum of two complex sinusoids. One with a positive frequency, another with a negative frequency. Then, if we substitute into the Short-time Fourier transform equation, this signal and we window it. We can step through these different steps in which we first put X of N in the equation. Then, we are substituting by the sum of this two complex exponentials. Therefore, because of the linearity of the DFT, we can split these into two equally equations, equal equations in which, in each one we have a complex exponential as the input signal and of the amplitudes can be move outside, and basically what we get back to is the sum 2 DFT's of the window. And with frequency shifting operation. So, basically, and then here we see that the result is the spectrum of the window. Of course, frequency shifted by the frequency of the input signal, and multiplied by the amplitude, by half of the amplitude of the input signal plus of course the other window at the other complex exponential frequency. One is the minus frequency and the other is the plus frequency. So, this will be the result of these cosine, so which is basically the transform of the window. Shifted to the frequency of the input signal and multiply that with the amplitude of the input signal. When we this plot, we can understand this windowing process a little bit better. So, on the top, we have the window. Underneath, is the windowed, sinusoid that we have as our input signal. And then the transform of the window can be shown on the top in which the transform of this window within this case is a hanning window Is that magnitude spectrum centered around zero and with the symmetry and with a given phase. And now if we take the DFT off the windowed sinusoid, well what we are seeing is basically the same shape than the window but at the frequency of the sinusoid. The two frequencies of the sinusoid, the positive and the negative frequencies and at the phase of the sinusoid, too. So we have the two values for the two phases with this anti-symmetry that this analysis results into. So, from this discussion, we can realize the importance of the analysis window in the spectrum of a sinusoid and that's of any sound. It's clear that we have to spend some time explaining the windows. So an analysis window is generally a real function, and is asymmetric around the origin. And this is the simplest window, the rectangular window. Its time domain is nothing too particular, but it's magnitude spectrum is much more interesting. So time domain, it just has value of one, for the duration of the window, in this case 64, and the spectrum, the magnitude spectrum, has a shape which we call it as a sinc, because the transform is a sinc function. And it basically could be described in many different ways. But we focus on two main aspects, on what we call the main lobe, the peak at the center and we'll be talking about the width of the main lobe mainly. And then we talk about the side-lobes which are these small lobes next to it. And we basically focused on the level of the highest of these side lobes. So, we were talking about the highest side-lobe level. Okay, so there are many windows used in audio signal processing. And this is the list of windows available in the scipy module of Python. So we can go through them and we can see quite a variety of windows. Some of them we are not going to pay much attention to, but for example, we will be talking about the Blackman window. We'll be talking about the Hamming window, the Hamming window, we'll be talking about, for example, the triangular window, etc. Some others are not so much used in audio. Each window can be distinguished from the others by measuring the main lobe width and the side lobe level. And each windows offers a different compromise with respect to these two values. So, let's show some of them. So, the first one is rectangular window, and the equation shows how it's computed. And, the spectrum is what we call a sinc function, it's the sine Pi k where k is the frequency index divided by another sine function. So, if we look in the plots, the spectrum could be part is, well it's the manage of the spectrum so it's the absolute value of this WK so basically is a sign function with a kind of a thin waited function applied to them at the boundaries. So, it resolves into this, This shape here, the characteristic shape that's going to be called this sync function. And talking about how to describe it, we mentioned about the width of the main lobe and these has two bins and two bins means two samples and in this, we have to be careful because this is measured When the DFT is the same size than the window. So, if we take a window size of the same size of the window, let's say ten samples, then it's going to be two bins. But generally since we do zero padding, then the number of bins is higher. But this is because of the zero padding and we normally do it in order to better visualize the shapes. In fact, this shape has been generated by a lot of zero padding so we can have this as smooth visualization that ,strictly speaking, the number of bins that we refer to is two. And the side lobe level, the highest side lobe level, is minus 13.3 decibels. So, the distance between the center peak and the first side lobe level. Maybe the most popular window is the Hamming window, which is a raised cosine. So, the equation is, we do .5 + .5 of the cosine so this raises the cosine. So it's just one cycle of that cosine. And if we compute this spectrum it's also going to be expressed as sums of the syncfunctions. In fact all the windows can be expressed In the time domain by sums of cosines and in the frequency domain by sums of this sync function. So, in this case is the sum, in the frequency domain of three sync functions, okay. And, again, the two values that characterize this shape, these frequencies that main shape, is the width of the main lobe which is four bins, so twice as much as the rectangular function. And the side lobe level is minus 31 point 5 decibels, so which is lower. Okay, now the main lobe width wider And the side-lobe level is lower. The Hamming window is very similar to the honing, but with a small and insignificant difference. It's a raised cosine with a step in the side. By having these small steps into the sides. We get a m spectrum that maintains the same main look width. So that's good it doesn't get wider but in exchange we get much lower site lobe level -42.7 decimals and this is, as we are going to see an important thing. They ideally used to have the lowest side-lobe level and the narrowest possible main-lobe. So this a good window. Of course, nothing comes for free, so the side-lobe levels do not decrease so abruptly as they go away from the main log. The Blackman window is the sum of two sinusoids and with that we accomplish a significant improvement in terms of the side-lobe level measure. Okay, so we see the magnitude spectrum which the main lobe Is wider, is 6 bins, but the side-lobe level is lower, is 58 decibels. And that's good because that's starting to be quite useful value at the side-lobe level for many audio applications. And we'll come back to that. And then finally the window I want to and I'm talking about is the blackman-harris window is a very special one. Because you can basically say it has no side lobes. So, it's a sum of several cosines, in this case it's four cosines, with different coefficients in the summing. And then in the frequencies domain, the magnitude spectrum, the main lobe, again, gets wider. In this case, it's 8 bins. But the side-lobe level is -92dB and if we think about it in terms of signal-to-noise ratio, which is a very important factor in digital signals. 92 decibels is basically below the noise floor of 16 bits of the kind of signal that we deal with. So basically, that means this side lobes, and if we consider them as, As artifacts or a noise, they are not heard. In other windows we could say that these side-lobes are artifacts that can't be heard. Anyway, again, we will come back to that. And now to finish let me just compare some of these windows being applied to the same sound. So, we start with a fragment of a sound of a certain length and we are applying three different windows. The first one is the rectangle, the next one hamming, and finally, the blackman. Clearly, very distinct spectra. And by looking at these, we can see kind of that maybe the best for this particular analysis is the blackman. We see a smoother spectrum, we see these peaks are much more clearly distinct and in fact, these peaks correspond to the harmonics of the sound. Okay, so this is all and there is a lot of references for the topics I covered, especially about windows. In Wikipedia, you can find quite a bit of information about Short-Time Fourier Transform about windows. Julius and his website and his online back discusses this quite a bit. So, that's a very good reference. And that's the researchers and their credits and references. So, this is all for the first part of the lecture on the Short-Time Fourier Transform. We have explained the basic equation of the Short-Time Fourier transform, and we have focused on the analysis window. In the second part, we will continue with this topic. So, I will see you in the next class.