In the previous article, we used a filter to suppress noise components in a voice recording. This approach is rather ineffective. One problem is that you can’t suppress noise frequencies that are close to signal frequencies, especially if your system cannot realistically implement a very-high-order filter. Another concern is that the filtering action is not selective: it does not distinguish between signal and noise within a given frequency band, and consequently the characteristics of the voice are altered in undesirable ways.

In this article we’ll look at a different technique for audio noise reduction. It’s called spectrum subtraction, and it consists of the following steps:

Make a recording that contains only background noise (or extract a noise-only portion of the original recording).

Generate a noise spectrum by computing the FFT and saving the magnitude data.

(Additional steps, namely averaging and prewhitening, can be performed to make the noise spectrum a more effective means of reducing noise.)

Generate a reduced-noise magnitude spectrum of the audio file by subtracting the noise spectrum from the magnitude of the FFT of the original recording.

Insert phase information into the reduced-noise spectrum by duplicating the phase information from the FFT of the original recording.

Create the reduced-noise time-domain waveform via the inverse FFT.

If you’re a bit confused, don’t worry—these steps will be discussed in greater detail as we work our way through the Scilab commands.

Also, I’m going to warn you ahead of time that spectral subtraction didn’t work very well for me. It preserved the quality of the voice better than filtering did, and FFT plots confirm that it can remove narrowband noise components in a way that filtering cannot. However, the reduction in noise volume was, for some reason, almost unnoticeable. It’s possible that a more sophisticated implementation would produce better results, and perhaps I’ll write another article on spectrum subtraction if I find a way to make it more effective.

Generate the Noise Spectrum

I duplicated the audio setup used to create the original audio file, then I captured about 60 seconds of “silence.” This silence becomes hiss in the recording. As you will see later, you want the noise recording to be the same length as the voice recording. My voice recording is about 10 seconds long, so why did I record 60 seconds of noise? Well, this was my attempt to incorporate averaging into the noise spectrum:

So I split the 60-second recording into six 10-second segments. I then computed the magnitude spectrum for each one and averaged them all to create my final noise spectrum. Using an averaged noise spectrum did create a difference in the final recording, though I wouldn’t say that the overall quality was better.

The following plots show the general shape of my noise spectrum. I’m not going to convert the horizontal axis to frequency because we’re not really thinking in terms of real-life frequencies at this point. We’re treating the spectrum more like a normal array of data points.

Perform the Subtraction

This step is straightforward once you understand the following two things: First, we’re working only with FFT magnitudes here, not the complex numbers that are the initial result of the FFT computation. Second, we can’t have negative magnitude values in our noise spectrum. This means that you have to check for negative subtraction results and change all negative values to zero.

Restore the Phase

We now have a reduced-noise magnitude spectrum of the original recording. Unfortunately, we can’t convert the audio back to the time domain without phase. Where will we find phase information for this magnitude spectrum that we just artificially created by performing subtraction in the frequency domain? There’s only one place: the original noisy signal. So what we’re really doing here is preserving the Fourier phase of the original signal and modifying its Fourier magnitude. In many situations, preserving the original phase is not problematic because the noise is associated primarily with amplitude rather than phase.

The FFT of the original signal consists of complex numbers, and the final noise-reduced FFT must contain complex numbers so that it can be successfully returned to the time domain via the ifft() command. Thus, to insert phase information into the new spectrum, we have to first extract the phase from the real and imaginary parts of the original spectrum and then use the magnitude values in the new spectrum to generate a new array of complex numbers.

The following plots provide a great demonstration of the capabilities of spectrum subtraction. The first shows the 60 Hz noise spike (at value 600 on the horizontal axis) that is present in the original signal. In the second, the noise spike is gone and the nearby spectral components are not noticeably affected.

Return to the Time Domain

We now have a modified spectrum that is ready for conversion to a normal audio waveform. This is accomplished via the inverse FFT:

Audio_NoiseSubtracted = ifft(FFT_NoiseSubtracted);

Here are the original and (theoretically) improved audio files, so that you can form your own opinion of the success of spectrum subtraction:

Your browser does not support the audio element.

original recording

Your browser does not support the audio element.

recording after spectrum subtraction

Conclusion

As mentioned above, spectrum subtraction did not greatly reduce the volume of the background noise. When I listen carefully I think that I do notice an improvement of some kind, but it’s a subtle change and may be related primarily to the quality of the noise rather than the quantity. However, spectrum subtraction also introduced a disagreeable artifact that is especially noticeable at the beginning of the recording (before I start speaking). My voice sounds the same in both recordings, though, which is a major improvement over the filtering approach. If anyone has insights regarding how to make spectrum subtraction more effective, let us know in the comments.

Every implementation of spectral subtraction I’ve seen (except yours) uses short-time-DFT. Have you thought about breaking the audio into 20ms blocks and performing the spectral subtraction on that block? If your results sound OK, then try the next step of adding a non-rectangular window function, use overlapping blocks, and overlap-add for the final result.

Thanks for the comment. I didn’t think that a short-time DFT was needed in this case because the audio sample was short and the noise was uniform throughout the recording. Do you think that an STDFT would help even under these circumstances?

It seems to make a difference. Check out the linked mp3 file, in which the left channel is your original audio, and the right channel has had the STDFT spectral subtraction applied. Note that I didn’t use overlapping blocks or anything to reduce the ‘musical’ distortions. You’ll notice that the hiss is reduced, but some artifacts are added. https://nofile.io/f/XwkwaesnWMd/spectralSubtractionNoOverlap.mp3