Adding Sidetone to Skype

Description

Ever
use a headset with Skype – and were frustrated that it was too quiet? This article shows how to add the sound of your own voice to the headset and not feel exhausted from shouting to be heard.

Skype: I find your lack of feedback disturbing.

This article will discuss how to build a software tool that makes it easier to talk into Skype with a headset.

Recently my family began experimenting with Skype, but we found that talking over Skype can be exhausting with headphones. After doing a little research, I learned that the problem was that we only heard the other party. You'd think that this is a good thing,
but we have a social brain that works hard to gauge our behavior and adjust.

The answer is to feed a little bit of the microphone back into the earpiece, so that our brain knows how loud we're talking. This is called
side tone in the telecommunication industry. Without it, we talk louder and louder until we're sure that we'll be heard.

How to use the tool

First, let's take a look at how to use the application. Once it is running, there will be a microphone icon in the lower right hand corner of the screen:

Figure 1: Icon in the system notification tray

Clicking on it, will get the application control window:

Figure 2: The application's controls

Let's look at the side tone controls:

The “Side Tones” check box enables or disables the side tone. If you have headphones, you'll want this on; if you have loud speakers, you'll want it off.

The slider controls the volume of the microphone in the headset. You can change it while talking. The volume will vary with microphone and headset.

You can do a “sound check” to see if the feedback is working by clicking on the Sound Check button, and adjusting the volume. I found that the volume setting that works best in a conversation is much, much lower than what works in a sound check.

Next, let's look at the AGC (“Automatic Gain Control”) section. When making a call, the software can automatically adjust how loud you sound to the other party:

When “Low Pass” is checked, the headphone feedback is at 44100 samples/sec. The microphone sound is filtered to keep only the sounds below 8 Khz, converted to 16000/samples per second and sent to Skype. When it is not checked, headphone feedback is at 16000
samples/sec, and sent to Skype without the low pass filter.

When "Skype AutoGain" is checked, it signals to Skype that Skype can use its own algorithm. When clear, Skype is told not to apply any adjustments.

When "AutoGain" is checked, the custom Automatic Gain Control algorithm is used.

The Automatic Gain Control has three sliders:

The “Cutoff” slider controls the distinction between background noise and conversation. Sound below this level is cutoff and silence is sent to Skype. This will vary with microphone – more sensitive (expensive) microphones will pick more background noise
and be better with a higher setting.

The “Normal” slider controls the volume when you are talking normally. The Gain Control tries to raise the volume to this level.

The “Loud” slider controls the volume when you talk exceptionally loud. This rarely happens, but when you
do talk louder than the Normal level, the Gain Control tries to adjust the volume to this level.

How the Software Works

I originally started this project by creating an effect for the
Skype Voice Changer. My plan was to open a WaveStream, begin playing it on the headphones and copy the microphone stream to it.

I quickly found that this was not the way to add feedback. The underlying “WaveOut” system had a huge latency: Everything I heard in the headphones was at least a second or more behind what I was saying. This made it even harder to talk than before, and
I had to abandon it.

While researching the problem I found a DirectSound code sample that I could modify into doing what I wanted. (This initial prototype didn't coordinate with Skype – it attached to the microphone and copied the sounds to the output, at a lower volume. But
it was on always on.) The sound in the headphones is still slightly behind (the microphone) but is barely perceptible, and we'll muffle it a bit more to make it less distinguishable.

From here on I shall describe the major – or technically interesting – components of the program. We'll look at:

DirectSound and Circular buffers

Sample Window and Sizing the buffers

Using WaitHandle's to Synchronize with DirectSound

IIR Filters

Automatic Gain Control

Estimating loudness

Note: I won't be describing how to connect to Skype. Mark Heath's description is very good.

DirectSound and Circular Buffers

The DirectSound code is in AudioLoop.cs. The module sets DirectSound to capture sound from the default recording device at 16bits / sample at either 16000 or 44100 samples per second. The capture and playback buffers are configured in “looping” mode to act
as circular buffers. The capture buffer eventually overwrites samples, and we'll lose them if we don't act fast enough; if we don't update the playback buffer, it will repeat the same sound over and over.

The StartMicrophone() procedure sets up the capture and playback buffers. Then it creates a thread to do the work. The thread is at a high priority so that if the OS has a choice between (say) email or processing sound, it does the sound.

The StopMicrophone() procedure stops the worker thread and cleans up the resources.

The software processes a fixed number of samples at a time, called the “sample window.” The buffers are several times the size of sample window, so that the system can keep capturing and playing while the software is processing them. The sound processing
loop is the heart of the application:

Note the “for” loop at the bottom of the code. This preserves the last 10 incoming samples at the start of the buffer. This is needed to make the sound processing smooth, and will be discussed a bit later.

Sample Window and Sizing the Buffers

How big should the sample window be? This is bit of a trade off in responsiveness and design complexity.

I chose a window big enough to hold 10 milliseconds of sound. Since the ear is sensitive sound to delays of even 30 milliseconds, I cut this done so that a delay wouldn't be perceptible. (When I tried a 50 millisecond window, my voice came out the headphone
sounding like an echo... and I found myself talking slower and slower.) The sample window could be made smaller, but I am sure that there is a point where the OS won't schedule the audio loop to wake-n-run more frequently. And, as the sample window gets smaller,
the processing may drop in quality, because it doesn't have enough to work with.

The capture buffer is 8 times the size of the sample window. This ratio is arbitrary, but I wanted the buffer to be about an order of magnitude larger. My rationale is that if the processing falls behind, the sound – for the Skype call – won't be dropped.
I feel that it is more important to preserve sound quality for the other party than to preserve the quality of feedback.

The playback buffer is four times the sample window. I wanted it small, so that if the processing fell behind, the replaying of a sound will seem to be a continuation of a current sound.

When writing the sound to the playback buffer, we have to track where in the buffer to put the samples. I tried to use GetCurrentPosition() to find where to write to next into the playback buffer; this created terrible sound. Instead, the software uses a
local variable to track where to write next.

DirectSound and Notifications

How do we keep in sync with the sound capture – how do we know when a sample buffer is ready?

The application gives a table of buffer indices and WaitHandle's to DirectSound. When the capture buffer's write index reaches one of those indices, it signals the corresponding WaitHandle. The worker thread cycles performs a WaitOne() one each of the WaitHandle's,
one at a time. As a convenience, we use a specific kind of WaitHandle called AutoResetEvent. This type of WaitHandle sets itself back to a “wait” state once WaitOne() returns.

If the thread has gotten behind, the WaitOne will return immediately, the loop processes the sample, and begins to catch up with the work.

We must use a separate AutoResetEvent for each of the 8 capture windows. The AutoResetEvent doesn't tell us if it was signaled multiple times. If only one AutoResetEvent handle were used, it wouldn't know that two (or more) sample windows were ready. Instead,
it would process just one, falling further behind, adding latency. This would happen randomly overtime, and be hard to test consistently.

IIR: Infinite Impulse Response Filters

This project came together so quickly, so easy – once I found the right approach – that I couldn't resist getting fancy. I added a low-pass filter to muffle the feedback a little. And I added automatic gain control, as an experimental option.

For both of these I used a filtering algorithm called “IIR” (this stands for Infinite Impulse Response – but that term is a confusing mouthful, so let's just call it IIR). IIR is a special purpose virtual machine. Low-pass filters, high-pass filters, combinations
of those filters, and even equalizers, can be specified, and use very specific techniques (like a compiler) to convert them into an IIR implementation.

(You could, instead, “compile” the filters to be the resistor values to use in a hardware circuit.
That's programming in solder!)

The machine code for these IIR virtual machines is just two list of coefficients, called A & B. The software emulator is code that
looks like the following bit of code:

IIRs are easy to implement - and take less CPU power than other methods. But sometimes they sound poor; if they sound too bad, you'll want to use a different technique. I found that the low-pass filters in this project work will for some microphones, and
add a slight crackle to others.

Example Low Pass Filter

For the low pass filter to create the muffling, I used a Butterworth filter, using the code below. It takes a buffer of signed, 16-bit samples, and then converts the 16-bit values into a byte array suitable for the sound buffer.

The filter code is a bit different than the example code in the previous section. Most of the differences are for speed.

This code doesn't use a buffer for the old values, instead uses separate variables for the elements of the buffer. It uses I_0,I_1,I_2 instead of Sample[Idx], Sample[Idx-1], and Sample[Idx-2]. It also uses O_1, and O_2 instead of Out[Idx-1] and Out[Idx-2].

The A and B coefficients are put into a single array. It also adds two of the sample input values, and is missing a coefficient; this is because the B0 and B2 coefficients are always the same for this kind of filter.

There is one difference that is not for speed. These are tricks done to make the filter smooth, and needed because the sample window is so small. They preserve the state of filter. If we didn't preserve them, the filter would be starting and stopping so
frequently that it would add distracting clicks to the output sound. The filters performance would be weakened, because the sample window isn't big enough to hold sounds lower than (about) 200 Hz. Preserving these values, the filter isn't starting and stopping,
and doesn't really know about the sample window. All of the IIR filters in this program use similar techniques.

O_1, O_2 are explicitly preserved across calls by being stored in class variables

I_1 and I_2 are preserved by the audio loop (remember the warning about preserving 10 samples at the bottom of the loop?) The audio loop preserves the last 10 samples at the start of InBuffer. When this procedure is called, it retrieves the last two samples.

I decided next to tackle a problem where my wife's voice did not carry well on calls. This happens a lot to her with cell phones – and answering machines. I was pretty sure that the problem was poor automatic-gain-control (AGC). The typical amplifier in
a headset (and in Skype) estimates how loud our voice is, then increases – or decreases – the volume to a reasonable level. It was deciding that my wife's voice was background noise, and cutting her off.

I chose to write an alternate gain control that amplified the sound and passed it to Skype. That way we'd have four to choose from: The one built into the Microphone, the Soundcard's, Mine, and Skype's. (To be fair, these automatic gain controls work well
in most cases).

The main portion of the gain control is implemented in the file GainControl.cs. The control algorithm is:

Calculate (or estimate) how loud we are currently talking (using the Analyze() procedure)

If the loudness is very low, no one is talking… so set the output to zero. (Without this step, the volume of noise and hum would be cranked up)

Otherwise, compute the guess-gain by dividing how loud the sound of our voice should be by how loud it currently is

Compute the gain (called MaxGain) at which the sound will start clipping. If the guess-gain is louder than this, reduce it to MaxGain.

The software adjusts the gain for a gentle transition – especially in the case when we go from absolute quiet, to the start of talking. It does this by tracking the gain (called PrevGain) used in previous sample and the current one.

Multiple all the samples by this gain value.

If the sample rate is greater than 16000 samples/sec

Do a low pass filter at 8 khz (again in IIR form). This helps prevent artifacts from down sampling

Resample the sound to 16000

The portion of code that calculates the gain looks like (CutOff_dB, LowGain_dB, and TgtGain_dB are the three slider values):

If you look at the code, you'll see that we don't compare directly with LowGain_db; rather we compare the estimate volume with LowGain_db+4. This gives a little “hysteresis” – if we raise our voice momentarily, the software won't suddenly make it the highest
possible volume. Instead, the software lowers the volume a little bit.

When the software changes the sample rate, it basically needs to know how many input samples to skip. At the start of a call, the software computes this, calling it InInc:

How loud “it should be” is controlled by a slider on the screen. The software estimates how loud the sound is by using
an algorithm devised by David Robinson that takes into account how it sounds
to a person. This way we can increase the gain on hard to hear sounds, and reduce the gain on sounds that a person is very sensitive to.

The loudness estimator, implemented in the Analysis procedure in GainAnalysis.cs, uses the following algorithm:

Uses a combination of two (IIR) filters to make the sound to reflect how our ears hear it.

Compute the Mean-Square value of the filtered sound (called MS)

Track the last 750 ms of these values.

Make a sorted copy of these.

Retrieve the first non-zero value at least 95% of the way into the buffer. This is so that we don't take the loudest sample and assume that is how the person is talking.

If MS (the value computed in step 2) is much quieter, use that one instead

Convert this value into dB scale by performing a logarithm on it.

The code to “normalize” the sound into how a person hears it is below. Along the way it computes the square of the samples (used in step 2). Like the earlier IIR filters, these preserves their variables across the calls. The first IIR is a yulewalk filter,
but it preserves it old intermediary values in an array. Like the trick in AudioLoop, where we copy the last 10 samples into the start of the current buffer, the analysis procedure copies the last 10 immediate values into the start of YuleTmp array.

The output of the yulewalk filter is feed into a 150Hz high pass filter. It is essentially the same as the low-pass filter described earlier.

Next, override this value if the current sample window is very, very quiet – that is, the user stopped talking. (If we don't do this, we'll amplify background noise between words)

C#

if (MS < X * 0.40 && MS < 12800.0)
X = MS;

Finally, convert the result into decibels (or a reasonable approximation of a decibel)

C#

return 10.0 * Math . Log10 (X * 0.5 + double . Epsilon);

Note: The logarithm function takes a positive, non-zero floating point number. However, the value we pass to it can be zero; if we pass a zero, though, the Logarithm function would return a bad value. The simplest thing to do is check to see if the value
we are passing is “zero” and not call Logarithm. However, I learned a long time ago to just add “epsilon” to the value to whatever we pass. This can really improve performance on number crunching.

Conclusion

This concludes how to add a little bit feedback and fancy amplification to you Skype phone calls.

If you want to try this out, the download link for the source code is at the top of the article!

If you'd like to experiment further, here are ideas of what can be done:

DirectSound has echo cancellation and noise suppression. These seem desirable, esp. if you wanted to try making your own speakerphone. I was not able to get them to work and I would love to learn how.

I'm sure it is possible to trim even more latency off of the side-tone playback, and I would be interested in learning better techniques to do so.

Another would be to create the ideal equalizer from Robinson's Equal Loudness model, and use that filter and amplify the sound.

It might be useful for the other party to control the settings, so that they decide when your voice has the right volume.

About The Author

Randall Maas writes firmware for medical devices, and consults in embedded software. Before that he did a lot of other things… like everyone else in the software industry. You can contact him at
randym@acm.org.

The Discussion

Ralph LaChance

Randall,

Thanks for preparing this work; looks very promising - I really dislike the absence of sidetones in Skype and because I instinctively end up talking extra loud on my headset (to hear myself) it bothers folks around me.

However, I'm having trouble with the code - specifically, the microphone initialization is failing - and sst aborts. Much of the time there actually is no mic on my system - it is present only when I actually plug in the headset - that sst would fail under those conditions I understand. -- But even when my headset/mic are plugged in sst fails to init the microphone object.

Presumably this is due to the transient nature of my usage - do you have any idea why the initialization would fail even when a headset/mic IS plugged in?

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.