Full-HD Voice: Understanding the AAC codecs behind a new era in communication

We have grown accustomed to "HD Everywhere" by consuming high-fidelity content in most aspects of our lives. State-of-the-art audio and video codecs such as MPEG AAC and H.264 have set our expectations by assuring the highest rich media quality at very low bit rates.

These codecs enable high-quality multimedia content for Digital TV, online streaming, media stores such as iTunes, video games, and many other state-of-the-art media services and applications. The only real exception to the omnipresence of high-quality sound is the phone call, which is still largely tied to the limitations of technologies derived from the last century.

With Full-HD Voice, a new era of audio quality for the telecommunications market has begun. Unlike Plain Old Telephone Services (POTS), ISDN and mobile phone calls, Full-HD Voice offers an unsurpassed level of quality, resulting in calls that sound as clear as talking to someone in the same room, or listening to high-quality digital audio.

The current high-quality codec family behind Full-HD Voice is Enhanced Low Delay AAC (AAC-ELD). In addition to the millions of calls already being made today using AAC-ELD, this technology is set to enable many new Full-HD Voice applications, including telepresence at home and mobile rich media telephony.

This paper explains the advantages and opportunities of Full-HD Voice, including how AAC-ELD meets the high quality requirements users expect today. Full-HD Voice already leads the industry with the latest communication advancements and is set to drive future innovations.

The Gap in Quality Expectations
It is no secret that the vast majority of phone calls sound muffled compared to other sources of audio. Calls today have shortcomings that can make it difficult to understand conversations especially in noisy or reverberant environments, listening to talkers with soft or whispered speech and following conversations in a non-native language or with an accent.

For example, the low audio bandwidth makes distinguishing between certain consonants such as "f" and "s" quite difficult. Both share a similar low frequency spectral envelope, but the "s" phoneme is characterized by its significant energy in the 10-kHz frequency range [Figure 1].

Figure 1: Typical spectrum of speech, here: s-phoneme.

Communication systems based on speech codecs with low audio bandwidth are unsuitable for sharing music, singing and ambient sounds. In addition, delays lead to involuntary interruptions, impacting a natural conversation flow. Many calls require repetition and even "spelling bees" to determine what someone is trying to communicate.

These problems challenge the patience of call participants, can cause frequent misunderstandings, frustration and make phone calls simply exhausting. In addition, they limit phone calls to speech only, shutting out more natural communication options that include multiple talkers, ambience sounds, or music.

The low conversational audio quality stands in contrast to the multimedia capabilities available with the latest smartphones or tablets. Practically all of these devices can play very high quality audio from mp3 or AAC files when using headphones, in conjunction with external or internal speakers.

In addition, virtually all smartphones and tablets have a camera function, recording not only HD (or even Full HD = 1080p) video but also near-CD-quality audio. Although the built-in microphones can capture a very high level of quality, in most cases only the camcorder function takes full advantage of this.

With all these high-quality capabilities and components used in modern smartphones, tablets and other consumer electronic devices such as TVs, why is it that phone calls still sound like they did in the last century?

The Full Audio Spectrum: Evolution to Full-HD Voice
Most phone calls employ speech codecs, not audio codecs. This simple fact is the basis of many issues. Audio codecs model the human auditory system. However, speech codecs model the human speech system and therefore can only reproduce the human voice reasonably well. Background noise, multiple voices, music and sounds are beyond the capabilities of a speech codec, rendering these sounds highly distorted or even unrecognizable.

Speech codecs are used in many POTS, ISDN and all mobile phone calls. In addition, these calls have an upper frequency limit of 3.4 kHz, also known as "narrowband". Because people are able to hear audio signals up to much higher frequencies (between 14 and 20 kHz at normal listening levels), most phone services are simply deleting at least three quarters of the audible spectrum. This causes the muffled sound in everyday telephony.

The "HD Voice" services recently introduced by some telephony providers have raised audio bandwidth from 3.4 to around 7 kHz. The speech codecs used in these services provide audibly better quality compared to legacy calls, but still only transmit less than half of the full audible audio spectrum.

Now, "Full-HD Voice" is available, delivering a new level of performance for voice calls and raising the quality to the level experienced in most digital media today.

High-quality audio codecs specifically tuned for communication applications are widely available. This makes Full-HD Voice no longer a future promise, but a reality today. Audio codecs, such as AAC-ELD, provide the same or even lower coding latencies than speech codecs, and open up the full audio spectrum, up to 14 to 20 kHz [Figure 2]. Consequently, Full-HD Voice-enabled audio codecs make optimal use of the multimedia hardware built into phones today. This will bring a stunning, new high-quality experience to our everyday communication.