Introduction

This code records and plays raw audio in PCM format, on Windows 8. It's written in VB but the same techniques apply to C#.

It uses IAudioClient, IAudioRenderClient, IAudioCaptureClient, part of the Windows Audio Session API (WASAPI).

Windows has several different API-families for recording and playing back audio:

MediaElement - All the UI frameworks have
a MediaElement control - XAML for Windows 8, WPF, Silverlight, WinForms. They let you play audio and video. But they don't have an interface for recording,
and they don't let you send your own raw audio data. (The best you can do is build a WAV file and tell them to play it).

waveOut and waveIn - These legacy APIs used to be the simplest
and easiest libraries to use from .NET, for recording and playing raw audio. Unfortunately they're not allowed in
Windows app-store apps.

WASAPI (Windows Audio Session API) - Introduced in Vista,
this C++/COM API is the new recommended way to record and play audio.

xAudio2 - This is a low-level C++/COM API for audio,
aimed at game developers, and allowed in Windows app-store apps. It is the successor to DirectSound.

NAudio. As a .NET developer, if you want audio, your first port of call should normally be http://naudio.codeplex.com/.
NAudio is a high quality library, available through NuGet, licensed under MS-Pl, for adding audio to VB and C# apps. It supports waveOut/waveIn, WASAPI, and DirectSound.

Sharpdx. This is a managed wrapper for DirectX, at http://www.sharpdx.org/. Amongst other things, it wraps xAudio2. It has also been used in successful submissions to the windows app store. So it's another candidate.

I wrote this article because I didn't know about Sharpdx, and because (as of September 2012) NAudio doesn't work in Windows 8 app-store apps. That's because it includes calls to disallowed legacy APIs.
Also, Windows 8 app-store apps require a different "entry-point" into the WASAPI from the traditional entry-point that NAudio uses.

Using the code

COM interop. WASAPI is thoroughly COM based, so it needs pinvoke interop wrappers to make it work under .NET. Moreover, in an audio application,
you deal with large quantities of data, and it's important to release resources in a timely fashion. VB and C# use the
IDispose mechanism for this, and rely
on .NET garbage collection for everything else. C++/COM uses IUnknown.Release and reference-counting instead. It's difficult to bridge the gap between the two.

You can skip over this section if you just want to read about WASAPI. But I don't advise it, since you'll need these skills in order to use the WASAPI correctly.

Here's an explanation of the above code. We will be concerned with the COM object, which maintains a reference count, and the .NET Runtime Callable Wrapper (RCW)
for it which has its own internal reference count. There's a one-to-one relationship between an RCW and an IUnknown IntPtr. The first time
the IUnknown IntPtr enters managed code (e.g., through allocating a new COM object, or
Marshal.ObjectFromIUnknown) the RCW is created,
its internal reference count is set to 1, and it calls IUnknown.AddRef just once. On subsequent times that the same
IUnknown IntPtr enters managed code,
RCW's internal reference count is incremented, but it doesn't call IUnknown.AddRef. RCW's internal reference count gets decremented through
natural garbage collection; you can also decrement it manually with Marshal.ReleaseComObject, or force it straight to 0 with
Marshal.FinalReleaseComObject. When RCW's internal reference count drops to 0, it calls
IUnknown.Release just once. Any further method on it will fail with the message: "COM object that
has been separated from its underlying RCW cannot be used".

After line 10: We have a COM object with ref count "1", and "e"
is a reference to its RCW, which has an internal reference count "1".

After line 20: The same COM object still has ref count "1", and "i" is a reference to the same RCW as before, which still has
an internal
reference count "1".

After line 30: The RCW's reference count dropped to "0", and
IUnknown.Release was called on the COM object,
which now has a ref count "0". Any further method or cast on "i" and "e" will fail.

Suggested practice: Make a wrapper class which implements IDisposable and which has a private field for the COM object (or more precisely, for its RCW).
Never let your clients have direct access to the COM object. It's fine for local variables in your wrapper-class to reference the same RCW if needed.
In IDisposable.Dispose, call Marshal.FinalReleaseComObject, set the field to
Nothing, and make sure that your wrapper never accesses methods or fields of that field again.

Enumerating audio devices, and getting IAudioClient

.NET 4.5: This is the .NET 4.5 code for enumerating audio devices, and getting the
IAudioClient for one of them.

Microphone permission. In Windows 8 we need to get an IAudioClient for a chosen recording/playback device. This isn't straightforward.
If you're using the APIs in an app-store app, and you want to initialize an
IAudioClient for a recording device, then the first time your application tries
this an alert will pop-up asking the user for permission for the app to record audio. Windows 8 does this for privacy reasons, so that apps don't surreptitiously
eavesdrop on their users. Windows will remember the user's answer and the user won't see that prompt again (unless they uninstall+reinstall the app). If the user
changes their mind, they can launch your app, Charms > Devices > Permissions. Microsoft has written
a more detailed
"Guidelines for devices that access personal data" including UI
guidelines on how to present this to the user. Incidentally, permission is always implicitly granted to desktop apps, and no permission is even needed for audio playback.

The way this works is that you construct your own object, in this case icbhR, which implements
IActivateAudioInterfaceCompletionHandler and IAgileObject.
Next you call ActivateAudioInterfaceAsync, passing this object. A short time later, on a different thread, your object's
IActivateAudioInterfaceCompletionHandler.ActivateCompleted method will be called. Inside the callback, you get hold
of the IAudioClient interface, and then you call IAudioClient.Initialize() on it with your chosen PCM format.
If Windows wanted to pop up its permissions prompt, then the call to Initialize() would block while the prompt is shown.
Afterwards, the call to Initialize() will either succeed (if
permission is granted), or fail with an UnauthorizedAccessException (if it isn't).
You must call Initialize from inside your callback, otherwise your app will block indefinitely on the
Initialize call.

The MSDN docs for ActivateAudioInterfaceAsync say
that ActivateAudioInterfaceAsync may display a consent prompt the first time it is called. That's incorrect. It's
Initialize() that may display the consent prompt;
never ActivateAudioInterfaceAsync. They also say that "In Windows 8, the first use of
IAudioClient to access the audio device should be on the STA thread.
Calls from an MTA thread may result in undefined behavior." That's incorrect. The first use of
IAudioClient.Initialize must be inside your
IActivateAudioInterfaceCompletionHandler.ActivateCompleted handler, which will have been invoked on a background thread by
ActivateAudioInterfaceAsync.

The above code requires you to implement this icbh object yourself. Here's my implementation. I made it implement the "awaiter pattern" with a method
called GetAwaiter: this lets us simply Await the icbh, as in the above code. This code shows a typical use of
TaskCompletionSource, to turn an API that uses callbacks
into a friendlier one that you can await.

The MSDN docs for IActivateAudioInterfaceCompletionHandler
say that the object must be "an agile object (aggregating a free-threaded marshaler)". That's incorrect. The object's IMarshal interface
is never even retrieved. All that's required is that it implements IAgileObject.

Picking the audio format

In the above code, as an argument to the IAudioClient.Initialize method, I went straight for CD-quality audio (stereo, 16 bits per sample, 44100Hz). You can only pick formats that the device supports natively, and many devices (including the Surface) don't even support this format...

There are some other ways you can pick an audio format. I generally don't like them, because they return
WAVEFORMATEX structures with a bunch of extra OS-specific data at the end.
That means you have to keep the IntPtr that's given to you, if you want to pass it to
Initialize(), and you have to Marshal.FreeCoTaskMem on it at the end. (Alternatively: what NAudio does is more elegant: it defines its own custom marshaller which is able to marshal in and out that extra data).

Recording audio

Here's the code to record audio. In this case I have already allocated a buffer "buf" large enough to hold 10 seconds of audio, and I merely copy into that.
You might instead want to work with smaller buffers that you re-use.

The code is event-based. It works with a Win32 event handle, obtained with the
Win32 function CreateEventEx. Every time a new buffer of audio data is available,
the event gets set. Because I passed "0" as a flag to CreateEventEx, it was an auto-reset event, i.e.,
it gets reset every time I successfully wait for it. Here's the small helper function
WaitForSingleObjectAsync which lets me use a nice Await syntax:

P/Invoke interop libraries for WASAPI and IAudioClient

All that's left is a huge pinvoke interop library. It took me several days to piece all this together. I'm not a pinvoke expert by any means.
I bet there are bugs in the definitions, and I'm sure they don't embody best-practice.

Notes

Disclaimer: although I work at Microsoft on the VB/C# language team, this article is strictly a personal amateur effort based on public information and
experimentation - it's not in my professional area of expertise, is written in my own free time not as a representative of Microsoft, and neither Microsoft
nor I make any claims about its correctness.

Share

About the Author

Lucian studied theoretical computer science in Cambridge and Bologna, and then moved into the computer industry. Since 2004 he's been paid to do what he loves -- designing and implementing programming languages! The articles he writes on CodeProject are entirely his own personal hobby work, and do not represent the position or guidance of the company he works for. (He's on the VB/C# language team at Microsoft).

Comments and Discussions

Hi!
In your sample code (good work btw.) you use 44kHz 16bit stereo to record audio.
After porting the sample to C# and playing around with the settings a bit, I found that (at least on my Win7 machine) IAudioClient.Initialize() always fails with AUDCLNT_E_UNSUPPORTED_FORMAT when I try to set anything less than your settings (22kHz stereo or 44kHz mono or 22kHz mono) for recording.IAudioClient.IsFormatSupported() always suggests a hilariously overpowered format (44kHz 32bit stereo) in this case (or, depending on the audio device, even 48kHz) when all I want is to record 22kHz 16bit mono.

With WaveIO it was (and is) no problem to get less-than-top-notch recordings. Do I really have to use maximum quality and then downsample myself? I can hardly believe this.